Re: [HACKERS] checkpointer continuous flushing

2016-01-09 Thread Amit Kapila
On Thu, Jan 7, 2016 at 4:21 PM, Andres Freund  wrote:
>
> On 2016-01-07 11:27:13 +0100, Fabien COELHO wrote:
> > I read your patch and I know what I want to try to have a small and
simple
> > fix. I must admit that I have not really understood in which condition
the
> > checkpointer would decide to close a file, but that does not mean that
the
> > potential issue should not be addressed.
>
> There's a trivial example: Consider three tablespaces and
> max_files_per_process = 2. The balancing can easily cause three files
> being flushed at the same time.
>

Won't the same thing can occur without patch in mdsync() and can't
we handle it in same way?  In particular, I am referring to below code:

mdsync()

{

..

/*

* It is possible that the relation has been dropped or

* truncated since the fsync request was entered.

* Therefore, allow ENOENT, but only if we didn't fail

* already on this file.  This applies both for

* _mdfd_getseg() and for FileSync, since fd.c might have

* closed the file behind our back.

*

* XXX is there any point in allowing more than one retry?

* Don't see one at the moment, but easy to change the

* test here if so.

*/

if (!FILE_POSSIBLY_DELETED(errno) ||

failures > 0)

ereport(ERROR,

(errcode_for_file_access(),

errmsg("could not fsync file \"%s\": %m",

path)));

else

ereport(DEBUG1,

(errcode_for_file_access(),

errmsg("could not fsync file \"%s\" but retrying: %m",

  path)));
}




With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Speedup twophase transactions

2016-01-09 Thread Simon Riggs
On 9 January 2016 at 12:26, Stas Kelvich  wrote:


> I’ve updated patch and wrote description of thighs that happens
> with 2PC state data in the beginning of the file. I think now this patch
> is well documented,
> but if somebody points me to places that probably requires more detailed
> description I’m ready
> to extend that.
>

Hmm, I was just preparing this for commit.

Please have a look at my mild edits and extended comments.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


2pc_optimize.v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-09 Thread Andres Freund
On 2016-01-09 18:04:39 +0530, Amit Kapila wrote:
> On Thu, Jan 7, 2016 at 4:21 PM, Andres Freund  wrote:
> >
> > On 2016-01-07 11:27:13 +0100, Fabien COELHO wrote:
> > > I read your patch and I know what I want to try to have a small and
> simple
> > > fix. I must admit that I have not really understood in which condition
> the
> > > checkpointer would decide to close a file, but that does not mean that
> the
> > > potential issue should not be addressed.
> >
> > There's a trivial example: Consider three tablespaces and
> > max_files_per_process = 2. The balancing can easily cause three files
> > being flushed at the same time.
> >
> 
> Won't the same thing can occur without patch in mdsync() and can't
> we handle it in same way?  In particular, I am referring to below code:

I don't see how that's corresponding - the problem is that current
proposed infrastructure keeps a kernel level (or fd.c in my versio) fd
open in it's 'pending flushes' struct. But since that isn't associated
with fd.c opening/closing files that fd isn't very meaningful.


> mdsync()

That seems to address different issues.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-09 Thread Andres Freund
On 2016-01-07 21:17:32 +0100, Andres Freund wrote:
> On 2016-01-07 21:08:10 +0100, Fabien COELHO wrote:
> > Hmmm. What I understood is that the workloads that have some performance
> > regressions (regressions that I have *not* seen in the many tests I ran) are
> > not due to checkpointer IOs, but rather in settings where most of the writes
> > is done by backends or bgwriter.
> 
> As far as I can see you've not run many tests where the hot/warm data
> set is larger than memory (the full machine's memory, not
> shared_buffers). That quite drastically alters the performance
> characteristics here, because you suddenly have lots of synchronous read
> IO thrown into the mix.
> 
> Whether it's bgwriter or not I've not fully been able to establish, but
> it's a working theory.

Hm. New theory: The current flush interface does the flushing inside
FlushBuffer()->smgrwrite()->mdwrite()->FileWrite()->FlushContextSchedule(). The
problem with that is that at that point we (need to) hold a content lock
on the buffer!

Especially on a system that's bottlenecked on IO that means we'll
frequently hold content locks for a noticeable amount of time, while
flushing blocks, without any need to.

Even if that's not the reason for the slowdowns I observed, I think this
fact gives further credence to the current "pending flushes" tracking
residing on the wrong level.


Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-09 Thread Stas Kelvich
Hi.

I’ve updated patch and wrote description of thighs that happens
with 2PC state data in the beginning of the file. I think now this patch is 
well documented,
but if somebody points me to places that probably requires more detailed 
description I’m ready
to extend that.



2pc_xlog.diff
Description: Binary data


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


> On 08 Jan 2016, at 19:29, Alvaro Herrera  wrote:
> 
> Simon Riggs wrote:
> 
>> I think we could do better still, but this looks like the easiest 80% and
>> actually removes code.
>> 
>> The lack of substantial comments on the patch is a problem though - the
>> details above should go in the patch. I'll have a go at reworking this for
>> you, this time.
> 
> Is someone submitting an updated patch here?
> 
> -- 
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-09 Thread Amit Kapila
On Sat, Jan 9, 2016 at 6:08 PM, Andres Freund  wrote:
>
> On 2016-01-09 18:04:39 +0530, Amit Kapila wrote:
> > On Thu, Jan 7, 2016 at 4:21 PM, Andres Freund 
wrote:
> > >
> > > On 2016-01-07 11:27:13 +0100, Fabien COELHO wrote:
> > > > I read your patch and I know what I want to try to have a small and
> > simple
> > > > fix. I must admit that I have not really understood in which
condition
> > the
> > > > checkpointer would decide to close a file, but that does not mean
that
> > the
> > > > potential issue should not be addressed.
> > >
> > > There's a trivial example: Consider three tablespaces and
> > > max_files_per_process = 2. The balancing can easily cause three files
> > > being flushed at the same time.
> > >
> >
> > Won't the same thing can occur without patch in mdsync() and can't
> > we handle it in same way?  In particular, I am referring to below code:
>
> I don't see how that's corresponding - the problem is that current
> proposed infrastructure keeps a kernel level (or fd.c in my versio) fd
> open in it's 'pending flushes' struct. But since that isn't associated
> with fd.c opening/closing files that fd isn't very meaningful.
>

Okay, but I think that is the reason why you are worried that it is possible
to issue sync_file_range() on a closed file, is that right or am I missing
something?



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] checkpointer continuous flushing

2016-01-09 Thread Andres Freund
On 2016-01-09 19:05:54 +0530, Amit Kapila wrote:
> Right that won't be acceptable, however I think with your latest
> proposal [1]

Sure, that'd address that problem.


> [...] think that idea will help to mitigate the problem of backend and
> bgwriter writes as well.  In that, can't we do it with the help of
> existing infrastructure of *pendingOpsTable* and
> *CheckpointerShmem->requests[]*, as already the flush requests are
> remembered in those structures, we can use those to apply your idea to
> issue flush requests.

Hm, that might be possible. But that might have some bigger implications
- we currently can issue thousands of flush requests a second, without
much chance of merging. I'm not sure it's a good idea to overlay that
into the lower frequency pendingOpsTable. Backends having to issue
fsyncs because the pending fsync queue is full is darn expensive. In
contrast to that a 'flush hint' request getting lost doesn't cost that
much.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-09 Thread Amit Kapila
On Sat, Jan 9, 2016 at 6:26 PM, Andres Freund  wrote:
>
> On 2016-01-09 18:24:01 +0530, Amit Kapila wrote:
> > Okay, but I think that is the reason why you are worried that it is
possible
> > to issue sync_file_range() on a closed file, is that right or am I
missing
> > something?
>
> That's one potential issue. You can also fsync a different file, try to
> print an error message containing an unallocated filename (that's how I
> noticed the issue in the first place)...
>
> I don't think it's going to be acceptable to issue operations on more or
> less random fds, even if that operation is hopefully harmless.
>

Right that won't be acceptable, however I think with your latest
proposal [1], we might not need to solve this problem or do we still
need to address it.  I think that idea will help to mitigate the problem of
backend and bgwriter writes as well.  In that, can't we do it with the
help of existing infrastructure of *pendingOpsTable* and
*CheckpointerShmem->requests[]*, as already the flush requests are
remembered in those structures, we can use those to apply your idea
to issue flush requests.



[1]
"It seems better to track a fixed
number of outstanding 'block flushes', independent of the file. Whenever
the number of outstanding blocks is exceeded, sort that list, and flush
all outstanding flush requests after merging neighbouring flushes."

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] checkpointer continuous flushing

2016-01-09 Thread Andres Freund
On 2016-01-09 18:24:01 +0530, Amit Kapila wrote:
> Okay, but I think that is the reason why you are worried that it is possible
> to issue sync_file_range() on a closed file, is that right or am I missing
> something?

That's one potential issue. You can also fsync a different file, try to
print an error message containing an unallocated filename (that's how I
noticed the issue in the first place)...

I don't think it's going to be acceptable to issue operations on more or
less random fds, even if that operation is hopefully harmless.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2016-01-09 Thread Fabien COELHO


Hello Andres,


Hm. New theory: The current flush interface does the flushing inside
FlushBuffer()->smgrwrite()->mdwrite()->FileWrite()->FlushContextSchedule(). The
problem with that is that at that point we (need to) hold a content lock
on the buffer!


You are worrying that FlushBuffer is holding a lock on a buffer and the 
"sync_file_range" call occurs is issued at that moment.


Although I agree that it is not that good, I would be surprise if that was 
the explanation for a performance regression, because the sync_file_range 
with the chosen parameters is an async call, it "advises" the OS to send 
the file, but it does not wait for it to be completed.


Moreover, for this issue to have a significant impact, it would require 
that another backend just happen to need this very buffer, but ISTM that 
the performance regression you are arguing about is on random IO bound 
performance, that is a few 100 tps in the best case, for very large bases, 
so a lot of buffers, so the probability of such a collision is very small, 
so it would not explain a significant regression.



Especially on a system that's bottlenecked on IO that means we'll
frequently hold content locks for a noticeable amount of time, while
flushing blocks, without any need to.


I'm not that sure it is really noticeable, because sync_file_range does 
not wait for completion.



Even if that's not the reason for the slowdowns I observed, I think this
fact gives further credence to the current "pending flushes" tracking
residing on the wrong level.


ISTM that I put the tracking at the level where is the information is 
available without having to recompute it several times, as the flush needs 
to know the fd and offset. Doing it differently would mean more code and 
translating buffer to file/offset several times, I think.


Also, maybe you could answer a question I had about the performance 
regression you observed, I could not find the post where you gave the 
detailed information about it, so that I could try reproducing it: what 
are the exact settings and conditions (shared_buffers, pgbench scaling, 
host memory, ...), what is the observed regression (tps? other?), and what 
is the responsiveness of the database under the regression (eg % of 
seconds with 0 tps for instance, or something like that).


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-01-09 Thread Simon Riggs
On 9 January 2016 at 12:26, Stas Kelvich  wrote:


> I’ve updated patch and wrote description of thighs that happens
> with 2PC state data in the beginning of the file. I think now this patch
> is well documented,
> but if somebody points me to places that probably requires more detailed
> description I’m ready
> to extend that.
>

Your comments say

  "In case of crash replay will move data from xlog to files, if
that hasn't happened before."

but I don't see that in code. Can you show me where that happens?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] [COMMITTERS] pgsql: Blind attempt at a Cygwin fix

2016-01-09 Thread Andrew Dunstan



On 01/08/2016 11:17 AM, Tom Lane wrote:

Alvaro Herrera  writes:

Alvaro Herrera wrote:

Obviously this wasn't the best idea ever.  Andrew suggests on IM to
revert this on Cygwin to just do the "isatty" check as originally.

Here's a proposed patch.  Thoughts?

Ugly, but it will hold the fort until someone can debug the service
code for Cygwin.





I downloaded the official Cygwin packages into a Cygwin instance and 
checked how they do things. As I rather expected, they do not use pg_ctl 
at all to install or run as a service. Rather, they use the standard 
Cygwin service utility cygrunsrv. This is all managed via a SYSV style 
init script.


So if anything I'd be inclined to disable all the service-related code 
in pg_ctl for Cygwin, and treat it just as we treat Unix.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-01-09 Thread Artur Zakirov

On 09.01.2016 05:38, Alvaro Herrera wrote:

Artur Zakirov wrote:


Now almost all dictionaries are loaded into PostgreSQL. But the da_dk
dictionary does not load. I see the following error:

ERROR: invalid regular expression: quantifier operand invalid
CONTEXT: line 439 of configuration file
"/home/artur/progs/pgsql/share/tsearch_data/da_dk.affix": "SFX 55 0 s
+GENITIV

If you open the affix file in editor you can see that there is incorrect
format of the affix 55 in 439 line (screen1.png):
  
  [ another email ]



I also had implemented a patch that fixes an error from the e-mail
http://www.postgresql.org/message-id/562e1073.8030...@postgrespro.ru
This patch just ignore that error.

I think it's a bad idea to just ignore these syntax errors.  This affix
file is effectively corrupt, after all, so it seems a bad idea that we
need to cope with it.  I think it would be better to raise the error
normally and instruct the user to fix the file; obviously it's better if
the upstream provider of the file fixes it.

Now, if there is proof somewhere that the file is correct, then the code
must cope in some reasonable way.  But in any case I don't think this
change is acceptable ... it can only cause pain, in the long run.
This error is raised in Danish dictionary because of erroneous entry in 
the .affix file. I sent a bug-report to developer. He fixed this bug. 
Corrected dictionary can be downloaded from LibreOffice site.


I undo the changes and the error will be raised. I will update the patch 
soon.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: [HACKERS] pglogical - logical replication contrib module

2016-01-09 Thread Steve Singer

On 12/31/2015 06:34 PM, Petr Jelinek wrote:

Hi,

I'd like to submit the replication solution which is based on the 
pglogical_output [1] module (which is obviously needed for this to 
compile).


The pglogical contrib module provides extension which does the 
master-slave logical replication based on the logical decoding.


The basic documentation is in README.md, I didn't bother making sgml 
docs yet since I expect that there will be ongoing changes happening 
and it's easier for me to update the markdown docs than sgml. I will 
do the conversion once we start approaching committable state.
I am going to send my comments/issues out in batches as I find them 
instead of waiting till I look over everything.



I find this part of the documentation a bit unclear


+Once the provider node is setup, subscribers can be subscribed to it. 
First the

+subscriber node must be created:
+
+SELECT pglogical.create_node(
+node_name := 'subscriber1',
+dsn := 'host=thishost port=5432 dbname=db'
+);
+

My initial reading was that I should execute this on the provider node.  
Perhaps instead

-
Once the provider node is setup you can then create subscriber nodes.  
Create the subscriber nodes and

then execute the following commands on each subscriber node

create extension pglogical

select pglogical.create_node(node_name:='subsriberX',dsn:='host=thishost 
dbname=db port=5432');


---

Also the documentation for create_subscription talks about

+  - `synchronize_structure` - specifies if to synchronize structure from
+provider to the subscriber, default true



I did the following

test2=# select pglogical.create_subscription(subscription_name:='default 
sub',provider_dsn:='host=localhost dbname=test1 port=5436');

 create_subscription
-
   247109879


Which then resulted in the following showing up in my  PG log

LOG:  worker process: pglogical apply 16542:247109879 (PID 4079) exited 
with exit code 1
ERROR:  replication slot name "pgl_test2_test1_default sub" contains 
invalid character
HINT:  Replication slot names may only contain lower case letters, 
numbers, and the underscore character.
FATAL:  could not send replication command "CREATE_REPLICATION_SLOT 
"pgl_test2_test1_default sub" LOGICAL pglogical_output": status 
PGRES_FATAL_ERROR: ERROR:  replication slot name 
"pgl_test2_test1_default sub" contains invalid character
HINT:  Replication slot names may only contain lower case letters, 
numbers, and the underscore character.



The create_subscription command should check if the subscription name is 
valid (meets the rules that will be applied against the slot command).


I wondered how I could fix my mistake.

The docs say

+- `pglogical.pglogical_drop_subscription(subscription_name name, 
ifexists bool)`

+  Disconnects the subscription and removes it from the catalog.
+

test2=# select pglogical.pglogical_drop_subscription('default sub', true);
ERROR:  function pglogical.pglogical_drop_subscription(unknown, boolean) 
does not exist



The command is actually called pglogical.drop_subscription the docs 
should be fixed to show the actual command name



I then wanted to add a second table to my database. ('b').

select pglogical.replication_set_add_table('default','public.b',true);
 replication_set_add_table
---
 t
(1 row)

In my pglog I then got

LOG:  starting sync of table public.b for subscriber defaultsub
ERROR:  replication slot name "pgl_test2_test1_defaultsub_public.b" 
contains invalid character
HINT:  Replication slot names may only contain lower case letters, 
numbers, and the underscore character.
FATAL:  could not send replication command "CREATE_REPLICATION_SLOT 
"pgl_test2_test1_defaultsub_public.b" LOGICAL pglogical_output": status 
PGRES_FATAL_ERROR: ERROR:  replication slot name 
"pgl_test2_test1_defaultsub_public.b" contains invalid character
HINT:  Replication slot names may only contain lower case letters, 
numbers, and the underscore character.



I then did

test1=# select pglogical.replication_set_remove_table('default','public.b');
 replication_set_remove_table
--
 t
(1 row)


but my log still keep repeating the error, so I tried connecting to the 
replica and did the same


test2=# select pglogical.replication_set_remove_table('default','public.b');
ERROR:  replication set mapping -303842815:16726 not found

Is there any way to recover from this situation?

The documenation says I can drop a replication set, maybe that will let 
replication continue.


+- `pglogical.delete_replication_set(set_name text)`
+  Removes the replication set.
+

select pglogical.delete_replication_set('default');
ERROR:  function pglogical.delete_replication_set(unknown) does not exist
LINE 1: select pglogical.delete_replication_set('default');
   ^
HINT:  No function matches the given name and argument types. You might 
need to add explicit type 

Re: [HACKERS] PATCH: Extending the HyperLogLog API a bit

2016-01-09 Thread Tomas Vondra



On 01/01/2016 12:08 AM, Peter Geoghegan wrote:

On Thu, Dec 31, 2015 at 12:48 PM, Tomas Vondra
 wrote:

1) initHyperLogLogError(hyperLogLogState *cState, double error)

Instead of specifying bwidth (essentially the number of bits used
for addressing in the counter), this allows specifying the expected
error rate for the counter, which is

   error_rate = 1.04 / sqrt(2^bwidth)

So for 5% we get bwidth=5, and so on. This makes the API a bit easier
the use, because there are pretty much no comments about the meaning
of bwidth, and the existing callers simply use 10 without details.


Fair, but you didn't add any better comments!


2) freeHyperLogLog(hyperLogLogState *cState)

I think it's a good idea to provide function "undoing" what init
does, i.e. freeing the internal memory etc. Currently that's trivial
to do, but perhaps we'll make the structure more complicated in the
future (albeit that might be unlikely).


Seems reasonable.


Attached is v2 of the patch, adding the comments.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 6cf109c466b60ecf62c27ce04757da51a9726729 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sat, 9 Jan 2016 20:45:03 +0100
Subject: [PATCH] extend HyperLogLog API

Extends the HyperLogLog API with two new functions. One allows
specifying desired error rate when initializing the counter,
the other one frees the internal counter state.
---
 src/backend/lib/hyperloglog.c | 45 +++
 src/include/lib/hyperloglog.h |  2 ++
 2 files changed, 47 insertions(+)

diff --git a/src/backend/lib/hyperloglog.c b/src/backend/lib/hyperloglog.c
index 094bc09..2dca070 100644
--- a/src/backend/lib/hyperloglog.c
+++ b/src/backend/lib/hyperloglog.c
@@ -108,6 +108,51 @@ initHyperLogLog(hyperLogLogState *cState, uint8 bwidth)
 }
 
 /*
+ * Initialize HyperLogLog track state
+ *
+ * Instead of specifying bwidth (number of bits used for addressing the
+ * register), this method allows sizing the counter for particular error
+ * rate using a simple formula from the paper:
+ *
+ *   e = 1.04 / sqrt(m)
+ *
+ * where 'm' is the number of registers, i.e. (2^bwidth). The method
+ * finds the lowest bwidth with 'e' below the requested error rate, and
+ * the uses it to initialize the counter.
+ *
+ * As bwidth has to be between 4 and 16, the worst error possible rate
+ * is ~25% (bwidth=4) and 0.4% (bwidth=16).
+ */
+void
+initHyperLogLogError(hyperLogLogState *cState, double error)
+{
+	uint8 bwidth = 4;
+
+	while (bwidth < 16)
+	{
+		double m = (Size)1 << bwidth;
+		if (1.04 / sqrt(m) < error)
+			break;
+		bwidth++;
+	}
+
+	initHyperLogLog(cState, bwidth);
+}
+
+/*
+ * Free HyperLogLog track state
+ *
+ * Releases the internal state, but not the state itself (in case it's
+ * not allocated by palloc).
+ */
+void
+freeHyperLogLog(hyperLogLogState *cState)
+{
+	Assert(cState->hashesArr != NULL);
+	pfree(cState->hashesArr);
+}
+
+/*
  * Adds element to the estimator, from caller-supplied hash.
  *
  * It is critical that the hash value passed be an actual hash value, typically
diff --git a/src/include/lib/hyperloglog.h b/src/include/lib/hyperloglog.h
index 5a1d4d3..b999b30 100644
--- a/src/include/lib/hyperloglog.h
+++ b/src/include/lib/hyperloglog.h
@@ -60,8 +60,10 @@ typedef struct hyperLogLogState
 } hyperLogLogState;
 
 extern void initHyperLogLog(hyperLogLogState *cState, uint8 bwidth);
+extern void initHyperLogLogError(hyperLogLogState *cState, double error);
 extern void addHyperLogLog(hyperLogLogState *cState, uint32 hash);
 extern double estimateHyperLogLog(hyperLogLogState *cState);
 extern void mergeHyperLogLog(hyperLogLogState *cState, const hyperLogLogState *oState);
+extern void freeHyperLogLog(hyperLogLogState *cState);
 
 #endif   /* HYPERLOGLOG_H */
-- 
2.1.0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-01-09 Thread Artur Zakirov

Thanks for review.

On 09.01.2016 02:04, Alvaro Herrera wrote:

Artur Zakirov wrote:

--- 74,80 
   
   typedef struct aff_struct

   {
!   uint32  flag:16,
type:1,
flagflags:7,
issimple:1,

By doing this, you're using 40 bits of a 32-bits-wide field.  What does
this mean?  Are the final 8 bits lost?  Does the compiler allocate a
second uint32 member for those additional bits?  I don't know, but I
don't think this is a very clean idea.
No, 8 bits are not lost. This 8 bits are used if a dictionary uses 
double extended ASCII character flag type (Conf->flagMode == FM_LONG) or 
decimal number flag type (Conf->flagMode == FM_NUM). If a dictionary 
uses single extended ASCII character flag type (Conf->flagMode == 
FM_CHAR), then 8 bits lost.


You can see it in decodeFlag function. This function is used in 
NIImportOOAffixes function, decode affix flag from string type and store 
in flag field (flag:16).



   typedef struct spell_struct
   {
!   struct
{
!   int affix;
!   int len;
!   }   d;
!   /*
!* flag is filled in by NIImportDictionary. After NISortDictionary, d
!* is valid and flag is invalid.
!*/
!   char   *flag;
charword[FLEXIBLE_ARRAY_MEMBER];
   } SPELL;

Here you removed the union, with no rationale for doing so.  Why did you
do it?  Can it be avoided?  Because of the comment, I'd imagine that d
and flag are valid at different times, so at any time we care about only
one of them; but you haven't updated the comment stating the reason for
that no longer to be the case.  I suspect you need to keep flag valid
after NISortDictionary has been called, but if so why?  If "flag" is
invalid as the comment says, what's the reason for keeping it?
Union was removed because the flag field need to be dynamically sized. 
It had 16 size in the previous version. In this field flag set are 
stored. For example, if .dict file has the entry:


mitigate/NDSGny

Then the "NDSGny" is stored in the flag field.

But in some cases a flag set can have size bigger than 16. I added this 
changes after this message 
http://www.postgresql.org/message-id/CAE2gYzwom3=11u9g8zxmt5plkzrwb12bwzxh4db3hud89fo...@mail.gmail.com

In that Turkish dictionary there are can be large flag set. For example:

özek/2240,852,749,5026,2242,4455,2594,2597,4963,1608,494,2409

This flag set has 56 size.
This "flag" is valid all the time. It is used in NISortDictionary and it 
is not used after NISortDictionary function has been called. Maybe you 
right and there are no reason for keeping it, and it is necessary to 
store all flags in separate variable, that will be deleted after 
NISortDictionary has been called.



The routines decodeFlag and isAffixFlagInUse could do with more
comments.  Your patch adds zero.  Actually the whole file has not nearly
enough comments; adding some more would be very good.

Actually, after some more reading, I think this code is pretty terrible.
I have a hard time figuring out how the original works, which makes it
even more difficult to figure out whether your changes make sense.  I
would have to take your patch on faith, which doesn't sound so great an
idea.

palloc / cpalloc / tmpalloc make the whole mess even more confusing.
Why does this file have three ways to allocate memory?

Not sure what's a good way to go about this.  I am certainly not going
to commit this as is, because if I do whatever bugs you have are going
to become my problem; and with the severe lack of documentation and
given how fiddly this stuff is, I bet there are going to be a bunch of
bugs.  I suspect most committers are going to be in the same position.
I think you should start by adding a few comments here and there on top
of the original to explain how it works, then your patch on top.  I
suppose it's going to be a lot of work for you but I don't see any other
way.

A top-level overview about it would be good, too.  The current comment
at top of file states:

  * spell.c
  * Normalizing word with ISpell

which is, err, somewhat laconic.

I will provide comments and explain how it works in comments. Maybe I 
will add some explanation about dictionaries structure. I will update 
the patch soon.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: [HACKERS] [PATCH] Add STRICT to some regression test C functions.

2016-01-09 Thread Tom Lane
Andreas Seltenreich  writes:
> some of the C functions in the regression test DB readily crash when
> passing NULL input values.  The regression tests themselves do not pass
> NULL values to them, but when the regression database is used as a basis
> for fuzz testing, they cause a lot of noise.  Maybe someone can sneak
> this patch in?

Pushed, thanks.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Blind attempt at a Cygwin fix

2016-01-09 Thread Michael Paquier
On Sun, Jan 10, 2016 at 2:00 AM, Andrew Dunstan  wrote:
> I downloaded the official Cygwin packages into a Cygwin instance and checked
> how they do things. As I rather expected, they do not use pg_ctl at all to
> install or run as a service. Rather, they use the standard Cygwin service
> utility cygrunsrv. This is all managed via a SYSV style init script.

Thanks for the investigation!

> So if anything I'd be inclined to disable all the service-related code in
> pg_ctl for Cygwin, and treat it just as we treat Unix.

We had better do the same for back branches then. Need of a patch?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: bloom filter in Hash Joins with batches

2016-01-09 Thread Peter Geoghegan
On Sat, Jan 9, 2016 at 11:02 AM, Tomas Vondra
 wrote:
> So, this seems to bring reasonable speedup, as long as the selectivity is
> below 50%, and the data set is sufficiently large.

What about semijoins? Apparently they can use bloom filters
particularly effectively. Have you considered them as a special case?

Also, have you considered Hash join conditions with multiple
attributes as a special case? I'm thinking of cases like this:

regression=# set enable_mergejoin = off;
SET
regression=# explain analyze select * from tenk1 o join tenk2 t on
o.twenty = t.twenty and t.hundred = o.hundred;
   QUERY PLAN
──
 Hash Join  (cost=595.00..4103.00 rows=5 width=488) (actual
time=12.086..1026.194 rows=100 loops=1)
   Hash Cond: ((o.twenty = t.twenty) AND (o.hundred = t.hundred))
   ->  Seq Scan on tenk1 o  (cost=0.00..458.00 rows=1 width=244)
(actual time=0.017..4.212 rows=1 loops=1)
   ->  Hash  (cost=445.00..445.00 rows=1 width=244) (actual
time=12.023..12.023 rows=1 loops=1)
 Buckets: 16384  Batches: 1  Memory Usage: 2824kB
 ->  Seq Scan on tenk2 t  (cost=0.00..445.00 rows=1
width=244) (actual time=0.006..3.453 rows=1 loops=1)
 Planning time: 0.567 ms
 Execution time: 1116.094 ms
(8 rows)

(Note that while the optimizer has a slight preference for a merge
join in this case, the plan I show here is a bit faster on my
machine).


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: bloom filter in Hash Joins with batches

2016-01-09 Thread Peter Geoghegan
On Sat, Jan 9, 2016 at 4:08 PM, Peter Geoghegan  wrote:
> Also, have you considered Hash join conditions with multiple
> attributes as a special case? I'm thinking of cases like this:

Sorry, accidentally fat-fingered my enter key before I was finished
drafting that mail. That example isn't useful, because there is no
early/cheap elimination of tuples while scanning the outer relation.

My point about bloom filters on only one attribute (or perhaps
multiple bloom filters on multiple attributes) is that you might be
able to make the bloom filter more effective, particularly relative to
the amount of memory available, by only building it for one attribute
where that distinction (one attribute vs multiple) happens to exist.

Simon said: "So the objective must be to get a Bloom Filter that is
small enough that it lives in a higher/faster level of cache than the
main Hash table". I agree that that's really important here, although
I did note that Tomas wasn't so sure, emphasizing the importance of
avoiding serialization and deserialization overhead -- and certainly,
that's why the patch helps some cases a lot. But Simon's point applies
more to the worst case than the best or average cases, and evidently
we have plenty of worrying about the worst case left to do here.

So, one attribute could have relatively low cardinality, and thus
fewer distinct items in the bloom filter's set, making it far slower
to degrade (i.e. have increased probability of false positives) as
compared to a composite of two or more attributes, and yet perhaps not
significantly less useful in terms of its ability to eliminate
(de)serialization overhead. Or, with two bloom filters (one per
attribute), one bloom filter could degrade far faster than the other
(due to having more distinct values), often very unpredictably, and
yet it wouldn't matter because we'd discard the bad one (maybe really
early, during the scan of the inner relation, using HLL).

I notice that all these example queries involve less-than operators
(inner relation tuples are thereby prevented from being entered into
the hash table). It might not be a terrible idea to hash abbreviated
keys (or even truncated abbreviated keys) for the bloom filter in
certain cases, in particular when there is a correlation in the inner
relation attributes (equi-joined attribute, and attribute that is
other part of qual). There might not be a correlation, of course, but
with some care hashing abbreviated keys may have virtually no
additional overhead, making it worth doing despite the general
uncertainty about any correlation existing. If you're going to accept
that the bloom filter can't be very large, which I think you must,
then hashing an entire value may be no better than hashing an
abbreviated key (those 8 bytes tend to have a lot of entropy). This
point is rather hand-wavy, though.

I suspect that BLOOM_ERROR_RATE isn't the most useful constraint here.
Is the degree to which it helps in sympathetic cases at all
commensurate with the degree to which it hurts in unsympathetic cases?
In your "low selectivity" cases (the sympathetic cases), there are
relatively few values represented in the main hash table, and
therefore in the bloom filter, and so a smaller bloom filter results
automatically anyway. But in the bad cases, the BLOOM_ERROR_RATE
constraint just wastes memory bandwidth for no gain, I think. If the
number of elements is so large that a reasonable fixed sized bloom
filter has a poor false positive rate anyway, we may have already
lost.

My sense is that we should try to make the bloom filter as cheap as
possible more so than trying to make sure it's useful before much work
has been done.

Is it okay that you don't treat MCV/skew values as special? Actually,
looking at this code, I think I notice something:

@@ -238,6 +238,15 @@ ExecHashJoin(HashJoinState *node)
 hashvalue);
node->hj_CurTuple = NULL;

+   /* If still in the first batch, we check the bloom filter. */
+   if ((hashtable->curbatch == 0) &&
+   (! ExecHashBloomCheckValue(hashtable, hashvalue)))
+   {
+   /* no matches; check for possible outer-join fill */
+   node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
+   continue;
+   }

Haven't we already established by now that the value of
"node->hj_CurSkewBucketNo" may not be INVALID_SKEW_BUCKET_NO, and so
the value certainly was found in the inner relation scan? Why bother
doing anything with the bloom filter in that common case? (Note that
I'm not suggesting that we don't establish "node->hj_CurSkewBucketNo"
first).

Also, are you aware of this?

http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf

It talks about bloom filters for hash joins in PostgreSQL
specifically. Interestingly, they talk about specific TPC-H queries.

BTW, I 

Re: [HACKERS] WIP: bloom filter in Hash Joins with batches

2016-01-09 Thread Peter Geoghegan
On Sat, Jan 9, 2016 at 11:02 AM, Tomas Vondra
 wrote:
> Which means the "dim.r" column has 100 different values (0-99) with uniform
> distribution. So e.g. "WHERE r < 15" matches 15%.

I think that the use of a uniform distribution to demonstrate this
patch is a bad idea, unless you want to have a conversation about the
worst case.

Look at the use cases for bloom filters in general. They're almost all
some variation on the same theme: checking a bloom filter
inexpensively, to avoid an expensive cache miss (of some fashion). The
Google Chrome web browser uses a bloom filter for malicious URLs. It
usually avoids consulting Google's servers about whether or not any
given URL that is visited is malicious by using the bloom filter. This
is effective in large part because the top 100 websites ranked by
popularity have a huge falloff in popularity as you go down the list.
It looks like a Zipfian distribution, and so I imagine they get pretty
far with a very small bloom filter, even though in theory the chances
of any given valid URL resulting in a cache miss is very high.

Generally, uniform distributions are rare in a non-canonical list of
things, like a hash join outer relation's attribute.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] ExecGather() + nworkers

2016-01-09 Thread Peter Geoghegan
The Gather node executor function ExecGather() does this:

/*
 * Register backend workers. We might not get as many as we
 * requested, or indeed any at all.
 */
pcxt = node->pei->pcxt;
LaunchParallelWorkers(pcxt);

/* Set up tuple queue readers to read the results. */
if (pcxt->nworkers > 0)
{
...
}

/* No workers?  Then never mind. */
if (!got_any_worker)
ExecShutdownGatherWorkers(node);

I'm not sure why the test for nworkers following the
LaunchParallelWorkers() call doesn't look like this, though:

/* Set up tuple queue readers to read the results. */
if (pcxt->nworkers_launched > 0)
{
...
}

ISTM, having followed the chain of calls, that the "if" statement I
highlight here is currently tautological (excluding the possibility of
one or two special cases in the CreateParallelContext() call performed
by ExecInitParallelPlan(). e.g., the IsolationIsSerializable() case
*can* result in caller's nworkers being overridden to be 0).

I guess it isn't surprising that the code evolved to look like this,
since the commit message of b0b0d84b ponders: "I suspect we'll want to
revise the Gather node to make use of this new capability [relaunching
workers], but even if not it may be useful elsewhere and requires very
little additional code". The nworkers_launched tracking came only in
this later commit.

>From my point of view, as a student of this code working on parallel
index builds (i.e new code which will also be a client of parallel.c,
a module which right now only has nodeGather.c as a client to learn
from), this is confusing. It's confusing just because the simpler
approach isn't taken when it could have been. It isn't actually wrong
that we figure out if any workers were launched in this relatively
laborious way:

/* Set up tuple queue readers to read the results. */
if (pcxt->nworkers > 0)
{
...

for (i = 0; i < pcxt->nworkers; ++i)
{
if (pcxt->worker[i].bgwhandle == NULL)
continue;

...

nworkers_launched = true;
...

But going to this additional trouble (detecting no workers launched on
the basis of !nworkers_launched) suggests that simply testing
nworkers_launched would be wrong, which AFAICT it isn't. Can't we just
do that, and in so doing also totally remove the "for" loop shown
here?

In the case of parallel sequential scan, it looks like one worker can
be helpful, because then the gather node (leader process) can run the
plan itself to some degree, and so there are effectively 2 processes
scanning at a minimum (unless 0 workers could be allocated to begin
with). How useful is it to have a parallel scan when this happens,
though?

I guess it isn't obvious to me how to reliably back out of not being
able to launch at least 2 workers in the case of my parallel index
build patch, because I suspect 2 workers (plus the leader process) are
the minimum number that will make index builds faster. Right now, it
looks like I'll have to check nworkers_launched in the leader (which
will be the only process to access the ParallelContext, since it's in
its local memory). Then, having established that there are at least
the minimum useful number of worker processes launched for sorting,
the leader can "permit" worker processes to "really" start based on
changing some state in the TOC/segment in common use. Otherwise, the
leader must call the whole thing off and do a conventional, serial
index build, even though technically the main worker process function
has started execution in worker processes.

I think what might be better is a general solution to my problem,
which I imagine will crop up again and again as new clients are added.
I would like an API that lets callers of LaunchParallelWorkers() only
actually launch *any* worker on the basis of having been able to
launch some minimum sensible number (typically 2). Otherwise, indicate
failure, allowing callers to call the whole thing off in a general
way, without the postmaster having actually launched anything, and
without custom "call it all off" code for parallel index builds. This
would probably involve introducing a distinction between a
BackgroundWorkerSlot being "reserved" rather than "in_use", lest the
postmaster accidentally launch 1 worker process before we established
definitively that launching any is really a good idea.

Does that make sense?

Thanks
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Spelling corrections

2016-01-09 Thread Peter Geoghegan
Attached patch fixes a couple of misspellings.

-- 
Peter Geoghegan
From 1c6149c0ed347d7514104a11bc3b47e1a9a5724a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sat, 9 Jan 2016 22:19:49 -0800
Subject: [PATCH] Fix a couple of misspellings

---
 src/backend/executor/nodeGather.c | 2 +-
 src/bin/pg_upgrade/controldata.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 046f156..16c981b 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -138,7 +138,7 @@ ExecGather(GatherState *node)
 	/*
 	 * Initialize the parallel context and workers on first execution. We do
 	 * this on first execution rather than during node initialization, as it
-	 * needs to allocate large dynamic segement, so it is better to do if it
+	 * needs to allocate large dynamic segment, so it is better to do if it
 	 * is really needed.
 	 */
 	if (!node->initialized)
diff --git a/src/bin/pg_upgrade/controldata.c b/src/bin/pg_upgrade/controldata.c
index 1f7b65e..2def729 100644
--- a/src/bin/pg_upgrade/controldata.c
+++ b/src/bin/pg_upgrade/controldata.c
@@ -539,7 +539,7 @@ check_control_data(ControlData *oldctrl,
 		pg_fatal("old and new pg_controldata block sizes are invalid or do not match\n");
 
 	if (oldctrl->largesz == 0 || oldctrl->largesz != newctrl->largesz)
-		pg_fatal("old and new pg_controldata maximum relation segement sizes are invalid or do not match\n");
+		pg_fatal("old and new pg_controldata maximum relation segment sizes are invalid or do not match\n");
 
 	if (oldctrl->walsz == 0 || oldctrl->walsz != newctrl->walsz)
 		pg_fatal("old and new pg_controldata WAL block sizes are invalid or do not match\n");
-- 
1.9.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers