Re: [GENERAL] Removing duplicate records from a bulk upload (rationale behind selecting a method)

2014-12-12 Thread Marc Mamin

>Thank Tom,
>I understand that the rationale behind choosing to create a new table from
>distinct records is that, since both approaches need full table scans,
>selecting distinct records is faster (and seems more straight forward) than
>finding/deleting duplicates;

Hi,
on a large table you may get it faster while using more than one thread. e.g.:

select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c;
select a,b,c into newtable from oldtable where a%8 =1 group by a,b,c;
...
select a,b,c into newtable from oldtable where a%8 =7 group by a,b,c;

This will/should use a shared full table scan on oldtable.

HTH

Marc Mamin


>
>Best regards,
>Daniel
>
>-Original Message-
>From: pgsql-general-ow...@postgresql.org
>[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Tom Lane
>Sent: December-08-14 21:52
>To: Scott Marlowe
>Cc: Andy Colson; Daniel Begin; pgsql-general@postgresql.org
>Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
>(rationale behind selecting a method)
>
>Scott Marlowe  writes:
>> If you're de-duping a whole table, no need to create indexes, as it's 
>> gonna have to hit every row anyway. Fastest way I've found has been:
>
>> select a,b,c into newtable from oldtable group by a,b,c;
>
>> On pass, done.
>
>> If you want to use less than the whole row, you can use select 
>> distinct on (col1, col2) * into newtable from oldtable;
>
>Also, the DISTINCT ON method can be refined to control which of a set of
>duplicate keys is retained, if you can identify additional columns that
>constitute a preference order for retaining/discarding dupes.  See the
>"latest weather reports" example in the SELECT reference page.
>
>In any case, it's advisable to crank up work_mem while performing this
>operation.
>
>regards, tom lane
>
>
>--
>Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
>changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-general
>
>
>
>-- 
>Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-general
>

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] anyone using oid2name?

2014-12-12 Thread Peter Eisentraut
pgsql-hackers are discussing some housekeeping in contrib.

Is anyone using the oid2name tool?

Otherwise, we might deprecate and eventually remove it.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] anyone using oid2name?

2014-12-12 Thread Bruce Momjian
On Fri, Dec 12, 2014 at 09:18:53AM -0500, Peter Eisentraut wrote:
> pgsql-hackers are discussing some housekeeping in contrib.
> 
> Is anyone using the oid2name tool?
> 
> Otherwise, we might deprecate and eventually remove it.

Uh, if we remove it, what tool does someone use from the command-line to
find the objects represented by files?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] anyone using oid2name?

2014-12-12 Thread Atri Sharma
On Friday, December 12, 2014, Bruce Momjian  wrote:

> On Fri, Dec 12, 2014 at 09:18:53AM -0500, Peter Eisentraut wrote:
> > pgsql-hackers are discussing some housekeeping in contrib.
> >
> > Is anyone using the oid2name tool?
> >
> > Otherwise, we might deprecate and eventually remove it.
>
> Uh, if we remove it, what tool does someone use from the command-line to
> find the objects represented by files?
>
>
>

+1

Removing a convenience tool without a stable alternative replacement
doesn't seem intuitive.

Regards,

Atri


-- 
Regards,

Atri
*l'apprenant*


Re: Fwd: Fwd: [GENERAL] Problem with pg_dump and decimal mark

2014-12-12 Thread Eric Svenson
Hi Adrian,

so finally I have a workaround which is ok for me. When I seperate the
tables and the data (using the -a and -s switch from pg_dump) into 2 sql
backup files, everything works ok on the problem-VM.

I try to investigate further in the coming weeks, I´m on holiday next week.

Regards and thanks for your support,
Eric

2014-12-10 15:27 GMT+01:00 Adrian Klaver :
>
> On 12/10/2014 01:32 AM, Eric Svenson wrote:
>
>> So, one more success...
>>
>> I have taken a part of the backup SQL file which fills the table
>>
>> COPY dev_my_settings (.) from stdin;
>> 12345  text   text   0   123.345345
>>
>> This file ALONE works! (without changing ANYTHING!)
>>
>
> Hmm, almost like the encoding/locale is changing in the complete file.
>
>
>> So if I run the first (huge) SQL file and then the second, which fills
>> the dev_my_settings table, everything is ok.
>>
>
> FYI if you do a pg_dump using the custom format(-Fc) you have more control
> over the restore. You can run pg_restore -l against the dump file to get a
> Table of Contents(TOC), which you can edit by commenting out items you do
> not want to restore and then feed back to pg_restore via -L. You also have
> the option to 'restore' all or part of the custom file to a text file using
> the -f option. The details can be found here:
>
> http://www.postgresql.org/docs/9.3/interactive/app-pgrestore.html
>
>
>> 2014-12-10 10:23 GMT+01:00 Eric Svenson >
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: [GENERAL] anyone using oid2name?

2014-12-12 Thread Peter Eisentraut
On 12/12/14 9:25 AM, Bruce Momjian wrote:
> On Fri, Dec 12, 2014 at 09:18:53AM -0500, Peter Eisentraut wrote:
>> pgsql-hackers are discussing some housekeeping in contrib.
>>
>> Is anyone using the oid2name tool?
>>
>> Otherwise, we might deprecate and eventually remove it.
> 
> Uh, if we remove it, what tool does someone use from the command-line to
> find the objects represented by files?

I don't know.  I want to find out what people are using it for.

One option would be to tweak some psql backslash commands to show the
information.  Most of the functionality is already there.



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] anyone using oid2name?

2014-12-12 Thread Bruce Momjian
On Fri, Dec 12, 2014 at 09:42:48AM -0500, Peter Eisentraut wrote:
> On 12/12/14 9:25 AM, Bruce Momjian wrote:
> > On Fri, Dec 12, 2014 at 09:18:53AM -0500, Peter Eisentraut wrote:
> >> pgsql-hackers are discussing some housekeeping in contrib.
> >>
> >> Is anyone using the oid2name tool?
> >>
> >> Otherwise, we might deprecate and eventually remove it.
> > 
> > Uh, if we remove it, what tool does someone use from the command-line to
> > find the objects represented by files?
> 
> I don't know.  I want to find out what people are using it for.
> 
> One option would be to tweak some psql backslash commands to show the
> information.  Most of the functionality is already there.

True.  It might be possible to just document a psql -c command to do it.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Removing duplicate records from a bulk upload (rationale behind selecting a method)

2014-12-12 Thread Daniel Begin
Seems promising but could you provide me a reference to PostgreSQL
documentation regarding this "a%8=*" feature?
Best

Daniel

-Original Message-
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Marc Mamin
Sent: December-12-14 06:41
To: Daniel Begin; 'Tom Lane'; 'Scott Marlowe'
Cc: 'Andy Colson'; pgsql-general@postgresql.org
Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)


>Thank Tom,
>I understand that the rationale behind choosing to create a new table 
>from distinct records is that, since both approaches need full table 
>scans, selecting distinct records is faster (and seems more straight 
>forward) than finding/deleting duplicates;

Hi,
on a large table you may get it faster while using more than one thread.
e.g.:

select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c; select
a,b,c into newtable from oldtable where a%8 =1 group by a,b,c; ...
select a,b,c into newtable from oldtable where a%8 =7 group by a,b,c;

This will/should use a shared full table scan on oldtable.

HTH

Marc Mamin


>
>Best regards,
>Daniel
>
>-Original Message-
>From: pgsql-general-ow...@postgresql.org
>[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Tom Lane
>Sent: December-08-14 21:52
>To: Scott Marlowe
>Cc: Andy Colson; Daniel Begin; pgsql-general@postgresql.org
>Subject: Re: [GENERAL] Removing duplicate records from a bulk upload 
>(rationale behind selecting a method)
>
>Scott Marlowe  writes:
>> If you're de-duping a whole table, no need to create indexes, as it's 
>> gonna have to hit every row anyway. Fastest way I've found has been:
>
>> select a,b,c into newtable from oldtable group by a,b,c;
>
>> On pass, done.
>
>> If you want to use less than the whole row, you can use select 
>> distinct on (col1, col2) * into newtable from oldtable;
>
>Also, the DISTINCT ON method can be refined to control which of a set 
>of duplicate keys is retained, if you can identify additional columns 
>that constitute a preference order for retaining/discarding dupes.  See 
>the "latest weather reports" example in the SELECT reference page.
>
>In any case, it's advisable to crank up work_mem while performing this 
>operation.
>
>regards, tom lane
>
>
>--
>Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To 
>make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-general
>
>
>
>--
>Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To 
>make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-general
>

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Removing duplicate records from a bulk upload (rationale behind selecting a method)

2014-12-12 Thread John McKown
On Fri, Dec 12, 2014 at 9:57 AM, Daniel Begin  wrote:

> Seems promising but could you provide me a reference to PostgreSQL
> documentation regarding this "a%8=*" feature?
> Best
>

​% is the modulus operator.

Assuming "a" is an integer (I don't remember), then doing 8 selects of "a
modulus 8" = for each of the possible results (0..7)​ will each select
about 1/8 of the entire table (I would guess) and the end result put
together, they will end up selecting all of the original table. I don't
know, myself, why this would be faster. But I'm not any kind of a
PostgreSQL expert either.



>
> Daniel
>
> -Original Message-
> From: pgsql-general-ow...@postgresql.org
> [mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Marc Mamin
> Sent: December-12-14 06:41
> To: Daniel Begin; 'Tom Lane'; 'Scott Marlowe'
> Cc: 'Andy Colson'; pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
> (rationale behind selecting a method)
>
>
> >Thank Tom,
> >I understand that the rationale behind choosing to create a new table
> >from distinct records is that, since both approaches need full table
> >scans, selecting distinct records is faster (and seems more straight
> >forward) than finding/deleting duplicates;
>
> Hi,
> on a large table you may get it faster while using more than one thread.
> e.g.:
>
> select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c;
> select
> a,b,c into newtable from oldtable where a%8 =1 group by a,b,c; ...
> select a,b,c into newtable from oldtable where a%8 =7 group by a,b,c;
>
> This will/should use a shared full table scan on oldtable.
>
> HTH
>
> Marc Mamin
>
>
> >
> >Best regards,
> >Daniel
> >
> >-Original Message-
> >From: pgsql-general-ow...@postgresql.org
> >[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Tom Lane
> >Sent: December-08-14 21:52
> >To: Scott Marlowe
> >Cc: Andy Colson; Daniel Begin; pgsql-general@postgresql.org
> >Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
> >(rationale behind selecting a method)
> >
> >Scott Marlowe  writes:
> >> If you're de-duping a whole table, no need to create indexes, as it's
> >> gonna have to hit every row anyway. Fastest way I've found has been:
> >
> >> select a,b,c into newtable from oldtable group by a,b,c;
> >
> >> On pass, done.
> >
> >> If you want to use less than the whole row, you can use select
> >> distinct on (col1, col2) * into newtable from oldtable;
> >
> >Also, the DISTINCT ON method can be refined to control which of a set
> >of duplicate keys is retained, if you can identify additional columns
> >that constitute a preference order for retaining/discarding dupes.  See
> >the "latest weather reports" example in the SELECT reference page.
> >
> >In any case, it's advisable to crank up work_mem while performing this
> >operation.
> >
> >regards, tom lane
> >
> >
> >--
> >Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To
> >make changes to your subscription:
> >http://www.postgresql.org/mailpref/pgsql-general
> >
> >
> >
> >--
> >Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To
> >make changes to your subscription:
> >http://www.postgresql.org/mailpref/pgsql-general
> >
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
> changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


-- 
​
While a transcendent vocabulary is laudable, one must be eternally careful
so that the calculated objective of communication does not become ensconced
in obscurity.  In other words, eschew obfuscation.

Maranatha! <><
John McKown


Re: [GENERAL] anyone using oid2name?

2014-12-12 Thread Jeff Janes
On Fri, Dec 12, 2014 at 6:18 AM, Peter Eisentraut  wrote:

> pgsql-hackers are discussing some housekeeping in contrib.
>
> Is anyone using the oid2name tool?
>
> Otherwise, we might deprecate and eventually remove it.
>

I use it occasionally (but really dislike it) during testing and debugging,
to help decipher what is going on.  It is a chain of information flow from
strace to lsof to oid2name.

I could probably use pg_class instead if I could ever remember what the
difference was between oid and relfilenode.

Cheers,

Jeff


Re: [GENERAL] anyone using oid2name?

2014-12-12 Thread Bruce Momjian
On Fri, Dec 12, 2014 at 09:18:01AM -0800, Jeff Janes wrote:
> On Fri, Dec 12, 2014 at 6:18 AM, Peter Eisentraut  wrote:
> 
> pgsql-hackers are discussing some housekeeping in contrib.
> 
> Is anyone using the oid2name tool?
> 
> Otherwise, we might deprecate and eventually remove it.
> 
> 
> I use it occasionally (but really dislike it) during testing and debugging, to
> help decipher what is going on.  It is a chain of information flow from strace
> to lsof to oid2name.
> 
> I could probably use pg_class instead if I could ever remember what the
> difference was between oid and relfilenode.

The fact oid2name is called that, but references relfilenode, doesn't
help my memory either.  ;-)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Re: Removing duplicate records from a bulk upload (rationale behind selecting a method)

2014-12-12 Thread David G Johnston
John McKown wrote
> I don't
> know, myself, why this would be faster. But I'm not any kind of a
> PostgreSQL expert either.

It is faster because PostgreSQL does not have native parallelism.  By using
a%n in a where clause you can start n separate sessions and choose a
different value of n for each one and manually introduce parallelism into
the activity.

Though given this is going to likely be I/O constrained the possible gains
do not scale lineally with the number of sessions - which themselves
effectively max out at the number of cores available to the server.

David J.




--
View this message in context: 
http://postgresql.nabble.com/Re-Removing-duplicate-records-from-a-bulk-upload-rationale-behind-selecting-a-method-tp5829682p5830353.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Removing duplicate records from a bulk upload (rationale behind selecting a method)

2014-12-12 Thread Marc Mamin

>>Seems promising but could you provide me a reference to PostgreSQL
>>documentation regarding this "a%8=*" feature?
>>Best

>
> % is the modulus operator.
>
>Assuming "a" is an integer (I don't remember), then doing 8 selects of "a 
>modulus 8" = for each of the possible results (0..7)? will each select about 
>1/8 of the entire table (I would guess) and the end result put together, they 
>will end up selecting all of the original table. I don't know, myself, why 
>this would be faster. But I'm not any kind of a PostgreSQL expert either.


yes.
Extracting unique values from very large sets is not for free,
neither from the I/O nor from the cpu point of view
spreading the tasks on more CPUs should reduce I/O.

(Does your huge table fit in RAM ?)

If you don't have int values available for the % operator, you may also consider
(hashtext(value))/%, but the extra work may result in no benefit.
hashtext is not documented as it is not garanteed to stay stables in future 
Postgres release,
but is safe in such a case  (http://lmgtfy.com/?q=postgres+hashtext+partition).

Another point is that I'm not sure that all threads will grep on a shared scan 
on a freshly created table
where the visiblity hint bit is not yet set:

   (see head comment in http://doxygen.postgresql.org/syncscan_8c_source.html)

...because reading a table for the first time implies to rewrite it:

http://www.cybertec.at/speeding-up-things-with-hint-bits/


You'll be able to avoid this extra I/O in upcoming 9.4 thanks to the new COPY 
FREEZE option:
http://www.postgresql.org/docs/9.4/interactive/sql-copy.html

hmmm,
on the other hand, I suppose that you will avoid the extra I/O for the hint bit
if you first copy your data in a temp table, but then you won't be able to 
parallelize the job
as other transactions won't see the data.
Moreover you need to pay attention to how much work_mem you can afford to each 
transaction,
knowing you have x of them running concurrently.

So at the end I'm not sure if multiple threads will help here.
I'm using this approach in aggregations which are more cpu intensive than a 
simple distinct.

I'm looking forward to see your tests results :)

Marc Mamin








>
>Daniel
>
>-Original Message-
>From: pgsql-general-ow...@postgresql.org
>[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Marc Mamin
>Sent: December-12-14 06:41
>To: Daniel Begin; 'Tom Lane'; 'Scott Marlowe'
>Cc: 'Andy Colson'; pgsql-general@postgresql.org
>Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
>(rationale behind selecting a method)
>
>
>>Thank Tom,
>>I understand that the rationale behind choosing to create a new table
>>from distinct records is that, since both approaches need full table
>>scans, selecting distinct records is faster (and seems more straight
>>forward) than finding/deleting duplicates;
>
>Hi,
>on a large table you may get it faster while using more than one thread.
>e.g.:
>
>select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c; 
> select
>a,b,c into newtable from oldtable where a%8 =1 group by a,b,c; ...
>select a,b,c into newtable from oldtable where a%8 =7 group by a,b,c;
>
>This will/should use a shared full table scan on oldtable.
>
>HTH
>
>Marc Mamin
>
>
>>
>>Best regards,
>>Daniel
>>
>>-Original Message-
>>From: pgsql-general-ow...@postgresql.org
>>[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Tom Lane
>>Sent: December-08-14 21:52
>>To: Scott Marlowe
>>Cc: Andy Colson; Daniel Begin; pgsql-general@postgresql.org
>>Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
>>(rationale behind selecting a method)
>>
>>Scott Marlowe  writes:
>>> If you're de-duping a whole table, no need to create indexes, as it's
>>> gonna have to hit every row anyway. Fastest way I've found has been:
>>
>>> select a,b,c into newtable from oldtable group by a,b,c;
>>
>>> On pass, done.
>>
>>> If you want to use less than the whole row, you can use select
>>> distinct on (col1, col2) * into newtable from oldtable;
>>
>>Also, the DISTINCT ON method can be refined to control which of a set
>>of duplicate keys is retained, if you can identify additional columns
>>that constitute a preference order for retaining/discarding dupes.  See
>>the "latest weather reports" example in the SELECT reference page.
>>
>>In any case, it's advisable to crank up work_mem while performing this
>>operation.
>>
>>regards, tom lane
>>
>>
>>--
>>Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To
>>make changes to your subscription:
>>http://www.postgresql.org/mailpref/pgsql-general
>>
>>
>>
>>--
>>Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To
>

Re: [GENERAL] pgxs question - linking c-functions to external libraries

2014-12-12 Thread Alan Nilsson
I recently had need to do the same thing and I am having no luck.  Admittedly, 
I am not too keen on the postgres build setup and have not debugged this 
extensively, but rather hoped there was an easy answer up front.  That said….

I am trying to link libuuid into a custom extension, here is my make file 
(building PG 9.3.5 on CentOS 6.5 (GCC 4.4.7) fwiw):

MODULES = aitpowerpg
EXTENSION = aitpowerpg
DATA = aitpowerpg--1.0.sql

SHLIB_LINK += -luuid

ifdef USE_PGXS
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)
else
subdir = contrib/aitpowerpg
top_builddir = ../..
include $(top_builddir)/src/Makefile.global
include $(top_srcdir)/contrib/contrib-global.mk
endif

The line SHLIB_LINK += -luuid has no effect.  All of postgres builds & runs ok, 
this custom module builds fine but won’t load due to missing symbols which 
makes sense since ldd shows that libuuid is not linked in.

If I make clean, make, then just execute the last gcc invocation to link the 
extension then manually tack -luuid at the end, all is good.

Is there something simple I’m missing?

thanks
alan

> On Jun 19, 2013, at 6:57 AM, Peter Eisentraut  wrote:
> 
> On 6/6/13 11:49 PM, Rad Cirskis wrote:
>> Hi John,
>> have you managed to get it to link with external shared libs?
> 
> Sure, many extensions to that.  Do something like
> 
> SHLIB_LINK += -lfoo
> 
> in your Makefile.
> 
> 
> 
> -- 
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general



Re: [GENERAL] pgxs question - linking c-functions to external libraries

2014-12-12 Thread Tom Lane
Alan Nilsson  writes:
> I am trying to link libuuid into a custom extension, here is my make file 
> (building PG 9.3.5 on CentOS 6.5 (GCC 4.4.7) fwiw):

> MODULES = aitpowerpg
> EXTENSION = aitpowerpg
> DATA = aitpowerpg--1.0.sql

> SHLIB_LINK += -luuid

> ifdef USE_PGXS
> PG_CONFIG = pg_config
> PGXS := $(shell $(PG_CONFIG) --pgxs)
> include $(PGXS)
> else
> subdir = contrib/aitpowerpg
> top_builddir = ../..
> include $(top_builddir)/src/Makefile.global
> include $(top_srcdir)/contrib/contrib-global.mk
> endif

> The line SHLIB_LINK += -luuid has no effect.

I'm not sure how well documented this is, but pgxs.mk quoth

#   SHLIB_LINK -- will be added to MODULE_big link line

Since you're using the MODULES target not the MODULE_big target, it gets
left out.  Try following the pattern in the contrib/uuid-ossp Makefile.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] pgxs question - linking c-functions to external libraries

2014-12-12 Thread Alan Nilsson
Awesome - thanks Tom, works.

alan

> On Dec 12, 2014, at 1:06 PM, Tom Lane  wrote:
> 
> Alan Nilsson  writes:
>> I am trying to link libuuid into a custom extension, here is my make file 
>> (building PG 9.3.5 on CentOS 6.5 (GCC 4.4.7) fwiw):
> 
>> MODULES = aitpowerpg
>> EXTENSION = aitpowerpg
>> DATA = aitpowerpg--1.0.sql
> 
>> SHLIB_LINK += -luuid
> 
>> ifdef USE_PGXS
>> PG_CONFIG = pg_config
>> PGXS := $(shell $(PG_CONFIG) --pgxs)
>> include $(PGXS)
>> else
>> subdir = contrib/aitpowerpg
>> top_builddir = ../..
>> include $(top_builddir)/src/Makefile.global
>> include $(top_srcdir)/contrib/contrib-global.mk
>> endif
> 
>> The line SHLIB_LINK += -luuid has no effect.
> 
> I'm not sure how well documented this is, but pgxs.mk quoth
> 
> #   SHLIB_LINK -- will be added to MODULE_big link line
> 
> Since you're using the MODULES target not the MODULE_big target, it gets
> left out.  Try following the pattern in the contrib/uuid-ossp Makefile.
> 
>   regards, tom lane



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Removing duplicate records from a bulk upload (rationale behind selecting a method)

2014-12-12 Thread Daniel Begin
Thank Marc (and all others)

 

I knew that nothing was for free and understanding where the costs come from
would provide me with some rationale to make my choice! 

However, I understand from your answer that there is no definitive approach
to do it right at this time (considering my specific context). Since the
tables are quite similar (number of fields and data types), I will then
probably explore some of the promising approaches proposed so far (below) to
get facts.

 

a) select distinct * into newtable from oldtable;

b) select a,b,c into newtable from oldtable group by a,b,c; 

c) select distinct a,b,c into newtable from oldtable where a%6=* group by
a,b,c;  

d) select a,b,c into newtable from oldtable where a%6=* group by a,b,c;

 

c&d will be run after having set the visibility bit with a select count.
Running on 8 cores, only 6 will be used (a%6=*)

Something else to add/replace on scenarios above (a-d) ?

 

Before going further, here are some context/concerns you may wish to comment
.

- Most of the tables do not fit in my computer's 32GB memory - since they
are usually between 35GB and 350GB;

- All the tables have a bigint ID I can use with the modulo operator
(id%6=*);

 

Working on Windows platform.

- work_mem: 1024MB

- maintenance_work_mem: 16384 (Understand the value must be significantly
larger than work_mem) 

- shared_buffers: 128MB (Understand that on Windows the useful range is 64MB
to 512MB)

 

Waiting comments and hoping to get back soon with useful results 

 

Daniel

 

 

From: Marc Mamin [mailto:m.ma...@intershop.de] 
Sent: December-12-14 14:25
To: John McKown; Daniel Begin
Cc: Tom Lane; Scott Marlowe; Andy Colson; PostgreSQL General
Subject: AW: [GENERAL] Removing duplicate records from a bulk upload
(rationale behind selecting a method)

 


>>Seems promising but could you provide me a reference to PostgreSQL
>>documentation regarding this "a%8=*" feature?
>>Best

>
> % is the modulus operator. 
>
>Assuming "a" is an integer (I don't remember), then doing 8 selects of "a
modulus 8" = for each of the possible results (0..7)? will each select about
1/8 of the entire table (I would guess) and the end result put together,
they will end up selecting all of the original table. I don't know, myself,
why this would be faster. But I'm not any kind of a PostgreSQL expert
either.


yes.
Extracting unique values from very large sets is not for free,
neither from the I/O nor from the cpu point of view
spreading the tasks on more CPUs should reduce I/O.

(Does your huge table fit in RAM ?)

If you don't have int values available for the % operator, you may also
consider 
(hashtext(value))/%, but the extra work may result in no benefit.
hashtext is not documented as it is not garanteed to stay stables in future
Postgres release,
but is safe in such a case
(http://lmgtfy.com/?q=postgres+hashtext+partition).

Another point is that I'm not sure that all threads will grep on a shared
scan on a freshly created table
where the visiblity hint bit is not yet set:

   (see head comment in
http://doxygen.postgresql.org/syncscan_8c_source.html)

...because reading a table for the first time implies to rewrite it:

http://www.cybertec.at/speeding-up-things-with-hint-bits/


You'll be able to avoid this extra I/O in upcoming 9.4 thanks to the new
COPY FREEZE option:
http://www.postgresql.org/docs/9.4/interactive/sql-copy.html

hmmm, 
on the other hand, I suppose that you will avoid the extra I/O for the hint
bit 
if you first copy your data in a temp table, but then you won't be able to
parallelize the job
as other transactions won't see the data.
Moreover you need to pay attention to how much work_mem you can afford to
each transaction, 
knowing you have x of them running concurrently.

So at the end I'm not sure if multiple threads will help here.
I'm using this approach in aggregations which are more cpu intensive than a
simple distinct.

I'm looking forward to see your tests results :)

Marc Mamin








>
>Daniel
>
>-Original Message-
>From: pgsql-general-ow...@postgresql.org
>[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Marc Mamin
>Sent: December-12-14 06:41
>To: Daniel Begin; 'Tom Lane'; 'Scott Marlowe'
>Cc: 'Andy Colson'; pgsql-general@postgresql.org
>Subject: Re: [GENERAL] Removing duplicate records from a bulk upload
>(rationale behind selecting a method)
>
>
>>Thank Tom,
>>I understand that the rationale behind choosing to create a new table
>>from distinct records is that, since both approaches need full table
>>scans, selecting distinct records is faster (and seems more straight
>>forward) than finding/deleting duplicates;
>
>Hi,
>on a large table you may get it faster while using more than one
thread.
>e.g.:
>
>select a,b,c into newtable from oldtable where a%8 =0 group by a,b,c;
select
>a,b,c into newtable from oldtable where a%8 =1 group by a,b,c; ...
>select a,b,c int

[GENERAL] function indexes, index only scan and sorting

2014-12-12 Thread Jonathan Vanasco

Can someone confirm a suspicion for me ?

I have a moderately sized table (20+ columns, 3MM rows) that tracks "tags".

I have a lower(column) function index that is used simplify case-insensitive 
lookups.

CREATE INDEX idx_tag_name_lower ON tag(lower(name));

I have a few complex queries that need to join back to this table (via the `id` 
primary key) and sort on `lower(name)`.

I'm not selecting `lower(name)`, just using it for an order-by.

The only way I seem to be able to avoid a Sequential Scan and run an index-only 
scan is with another index -- this one specifically (and I've run queries 
against 8 index permutations):

CREATE INDEX idx_tag_joins ON tag(id, name_display); 

Am I correct in observing that the value of a function index can't be used for 
sorting ?

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Removing duplicate records from a bulk upload (rationale behind selecting a method)

2014-12-12 Thread Jonathan Vanasco

On Dec 8, 2014, at 9:35 PM, Scott Marlowe wrote:

> select a,b,c into newtable from oldtable group by a,b,c;
> 
> On pass, done.

This is a bit naive, but couldn't this approach potentially be faster 
(depending on the system)?

SELECT a, b, c INTO duplicate_records FROM ( SELECT a, b, c, count(*) 
AS counted FROM source_table GROUP BY a, b, c ) q_inner WHERE q_inner.counted > 
1;
DELETE FROM source_table USING duplicate_records WHERE source_table.a = 
duplicate_records.a AND source_table.b = duplicate_records.b AND source_table.c 
= duplicate_records.c;

It would require multiple full table scans, but it would minimize the writing 
to disk -- and isn't a 'read' operation usually much more efficient than a 
'write' operation?  If the duplicate checking is only done on a small subset of 
columns, indexes could speed things up too.




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] function indexes, index only scan and sorting

2014-12-12 Thread Tom Lane
Jonathan Vanasco  writes:
> Am I correct in observing that the value of a function index can't be used 
> for sorting ?

No ...

regression=# create table tt (f1 int, f2 text);
CREATE TABLE
regression=# create index on tt (lower(f2));
CREATE INDEX
regression=# explain select * from tt order by lower(f2);
 QUERY PLAN 

 Index Scan using tt_lower_idx on tt  (cost=0.15..65.68 rows=1230 width=36)
(1 row)

Now, whether the planner will prefer this over seqscan-and-sort is a
much harder question.  Full-table index scans tend to require a lot
of random I/O so a sort is frequently seen as cheaper.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] function indexes, index only scan and sorting

2014-12-12 Thread Jonathan Vanasco

On Dec 12, 2014, at 4:58 PM, Tom Lane wrote:

> regression=# create table tt (f1 int, f2 text);
> CREATE TABLE
> regression=# create index on tt (lower(f2));
> CREATE INDEX
> regression=# explain select * from tt order by lower(f2);
> QUERY PLAN 
> 
> Index Scan using tt_lower_idx on tt  (cost=0.15..65.68 rows=1230 width=36)
> (1 row)


Thank you so much for posting this test.

I got a seq scan on my local machine, so I checked the version... still running 
9.2.4.
I tried it on production (which is 9.3.x) and got the same result as you.

Looking at the 9.3 release notes, I'm guessing this behavior is from one of the 
Optimizer fixes.



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] function indexes, index only scan and sorting

2014-12-12 Thread Tom Lane
Jonathan Vanasco  writes:
> Thank you so much for posting this test.

> I got a seq scan on my local machine, so I checked the version... still 
> running 9.2.4.
> I tried it on production (which is 9.3.x) and got the same result as you.

Hmm, well, I get the same result from 9.2.9, as well as every branch back
to 8.4 (too lazy to check further back).  Perhaps your 9.2.4 installation
is using non-default cost parameters that discourage the planner from
choosing this plan?

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Re: [GENERAL] I did some testing of GIST/GIN vs BTree indexing…

2014-12-12 Thread Guyren Howe
On Dec 10, 2014, at 19:38 , Bruce Momjian  wrote:
> 
> Are you saying when you use a GIN index on a,b,c fields, you can do
> lookups on them independently, like 'c'?  I was not aware that works,
> but it might.  I know it doesn't work for traditional btree as the index
> is hierarchical.  You can look up things like a,c and it will skip over
> 'b', but doing 'c' alone doesn't make any sense for traditional btree.
> 
> It would be interesting if that was true, though, and something we
> should more clearly document.  Your testing is very useful here.

This page:

http://www.postgresql.org/docs/9.4/static/indexes-multicolumn.html

says:

A multicolumn GiST index can be used with query conditions that involve any 
subset of the index's columns. Conditions on additional columns restrict the 
entries returned by the index, but the condition on the first column is the 
most important one for determining how much of the index needs to be scanned. A 
GiST index will be relatively ineffective if its first column has only a few 
distinct values, even if there are many distinct values in additional columns.

A multicolumn GIN index can be used with query conditions that involve any 
subset of the index's columns. Unlike B-tree or GiST, index search 
effectiveness is the same regardless of which index column(s) the query 
conditions use.



This appears to imply greater (complete?) flexibility in using non-leading 
columns with GIST and GIN indexes, or am I misunderstanding something? This is 
the whole reason I’ve started investigating this — particularly given what it 
says about GIN.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general