Re: I'd like to discuss scaleout at PGCon

2020-09-14 Thread MauMau
Hello all,

# I'm resending because some error occurred

I've overhauled the scaleout design wiki I presented at PGCon 2018
developer unconference and assembled the research of other DBMSs'
scale-out features.

Scaleout Design
https://wiki.postgresql.org/wiki/Scaleout_Design

I intentionally have put little conclusion on our specification and
design.  I'd like you to look at recent distributed databases, and
then think about and discuss what we want to aim for together.  I feel
it's better to separate a thread per topic or group of topics.

I'm sorry, but I'm not confident about the readability at all, because
I cannot draw figures due to my visual impairment, and the page is
full of text only.

What workload do you think we should focus on first, OLTP or
analytics?  I think OLTP, because open source Postgres probably has
been so far getting popular with OLTP.  Also, I don't expect many
people will use existing popular SaaS for data warehousing like Amazon
Redshift, Azure Synapse, Google BigQuery and Snowflake, rather than
build their analytics databases on public IaaS or on-premises.


Regards
MauMau





Re: I'd like to discuss scaleout at PGCon

2020-09-14 Thread MauMau
Hello all,


I've overhauled the scaleout design wiki I presented at PGCon 2018
developer unconference and assembled the research of other DBMSs'
scale-out features.

Scaleout Design
https://wiki.postgresql.org/wiki/Scaleout_Design

I intentionally have put little conclusion on our specification and
design.  I'd like you to look at recent distributed databases, and
then think about and discuss what we want to aim for together.  I feel
it's better to separate a thread per topic or group of topics.

I'm sorry, but I'm not confident about the readability at all, because
I cannot draw figures due to my visual impairment, and the page is
full of text only.

What workload do you think we should focus on first, OLTP or
analytics?  I think OLTP, because open source Postgres probably has
been so far getting popular with OLTP.  Also, I don't expect many
people will use existing popular SaaS for data warehousing like Amazon
Redshift, Azure Synapse, Google BigQuery and Snowflake, rather than
build their analytics databases on public IaaS or on-premises.


Regards
MauMau





Re: Protect syscache from bloating with negative cache entries

2019-02-08 Thread MauMau
From: Alvaro Herrera
> I think seqscanning the hash table is going to be too slow;
Ideriha-san
> idea of having a dlist with the entries in LRU order (where each
entry
> is moved to head of list when it is touched) seemed good: it allows
you
> to evict older ones when the time comes, without having to scan the
rest
> of the entries.  Having a dlist means two more pointers on each
cache
> entry AFAIR, so it's not a huge amount of memory.

Absolutely.  We should try to avoid unpredictable long response time
caused by an occasional unlucky batch processing.  That makes the
troubleshooting when the user asks why they experience unsteady
response time.

Regards
MauMau








Re: Protect syscache from bloating with negative cache entries

2019-02-08 Thread MauMau
From: Tomas Vondra
> I don't think we need to remove the expired entries right away, if
there
> are only very few of them. The cleanup requires walking the hash
table,
> which means significant fixed cost. So if there are only few expired
> entries (say, less than 25% of the cache), we can just leave them
around
> and clean them if we happen to stumble on them (although that may
not be
> possible with dynahash, which has no concept of expiration) of
before
> enlarging the hash table.

I agree in that we don't need to evict cache entries as long as the
memory permits (within the control of the DBA.)

But how does the concept of expiration fit the catcache?  How would
the user determine the expiration time, i.e. setting of
syscache_prune_min_age?  If you set a small value to evict unnecessary
entries faster, necessary entries will also be evicted.  Some access
counter would keep accessed entries longer, but some idle time (e.g.
lunch break) can flush entries that you want to access after the lunch
break.

The idea of expiration applies to the case where we want possibly
stale entries to vanish and load newer data upon the next access.  For
example, the TTL (time-to-live) of Memcached, Redis, DNS, ARP.  Is the
catcache based on the same idea with them?  No.

What we want to do is to evict never or infrequently used cache
entries.  That's naturally the task of LRU, isn't it?  Even the high
performance Memcached and Redis uses LRU when the cache is full.  As
Bruce said, we don't have to be worried about the lock contention or
something, because we're talking about the backend local cache.  Are
we worried about the overhead of manipulating the LRU chain?  The
current catcache already does it on every access; it calls
dlist_move_head() to put the accessed entry to the front of the hash
bucket.


> So if we want to address this case too (and we probably want), we
may
> need to discard the old cache memory context someho (e.g. rebuild
the
> cache in a new one, and copy the non-expired entries). Which is a
nice
> opportunity to do the "full" cleanup, of course.

The straightforward, natural, and familiar way is to limit the cache
size, which I mentioned in some previous mail.  We should give the DBA
the ability to control memory usage, rather than considering what to
do after leaving the memory area grow unnecessarily too large.  That's
what a typical "cache" is, isn't it?

https://en.wikipedia.org/wiki/Cache_(computing)

"To be cost-effective and to enable efficient use of data, caches must
be relatively small."


Another relevant suboptimal idea would be to provide each catcache
with a separate memory context, which is the child of
CacheMemoryContext.  This gives slight optimization by using the slab
context (slab.c) for a catcache with fixed-sized tuples.  But that'd
be a bit complex, I'm afraid for PG 12.


Regards
MauMau





Re: [bug fix] ECPG: freeing memory for pgtypes crashes on Windows

2018-06-26 Thread MauMau
From: Thomas Munro
> Ok, back-patched.

Thank you very much!

> It seems like the other patch[1] might need the same treatment,
right?

I believe so, because that patch is based on the same cause.


Regards
MauMau




Re: [HACKERS] Statement-level rollback

2018-06-15 Thread MauMau
From: Alvaro Herrera
> I've been looking at re-implementing this feature recently, using
> Tsunakawa's proposed UI of a GUC transaction_rollback_scope that can
> take values "transaction" (default, current behavior) and
"statement".
> I didn't take other parts of his patch though; see below.

Thank you so much for reviving this thread!


> I propose that we solve this by
> allowing this feature to be enabled only via one of:
>
> * a PGOPTIONS connection-time option
> * ALTER USER SET (transaction_rollback_scope)

Why don't we also allow ALTER DATABASE SET for a database exclusively
for data migrated from another DBMS?


> but it can be *disabled* normally via SET.  In other words, changing
the
> scope from transaction to statement in a running session is
forbidden,
> but changing it the other way around is allowed (if app is unsure
> whether env is unsafe, it can set the scope to "transaction" to
ensure
> it's safe from that point onwards).  Changing the scope in
> postgresql.conf is forbidden, so a server is never unsafe as a
whole.

Would it be dangerous to allow both enabling and disabling the
statement-level rollback only outside a transaction block?  I thought
it was considered dangerous to change the setting inside a transaction
block.


> Drivers such as JDBC can easily use this mode, for example a
connection
> option such as "AUTOSAVE=SERVER" can automatically add the
> transaction_rollback_scope option.  (Naturally, if the server does
not
> support transaction_rollback_scope and the user gave that option,
this
> causes an exception to be raised -- NOT to fallback to the standard
> transaction behavior!)

How do the drivers know, from the server error response to connection
request, that transaction_rollback_scope is unsupported?


> Tsunakawa's implementation puts the feature in postgres.c's client
loop.
> I think a better way to implement this is to change xact.c to have a
new
> TBLOCK state which indicates when to start a new internal
> subtransaction; StartTransactionCommand pushes a new element into
the
> transaction stack and puts it in the new state; a subsequent
operation
> actually starts the new subtransaction.  (This design decision
allows
> things like SAVEPOINT to work correctly by having the
> subtrasaction-for-savepoint appear *before* the internal
subtransaction,
> so a subsequent "SELECT 0/0" doesn't remove the user declared
> savepoint.)

That sounds interesting.

* How can PLs like PL/pgSQL utilize this to continue upon an SQL
failure?  They don't call StartTransactionCommand.

* How can psql make use of this feature for its ON_ERROR_ROLLBACK?


Regards
MauMau




Re: I'd like to discuss scaleout at PGCon

2018-06-06 Thread MauMau
From: Alvaro Herrera
> Maybe an easy (hah) thing to do is use 2PC for DDL, agree on a OID
> that's free on every node, then create the object in all servers at
the
> same time.  We currently use the system-wide OID generator to assign
the
> OID, but seems an easy thing to change (much harder is to prevent
> concurrent creation of objects using the arranged OID; maybe can
reuse
> speculative tokens in btrees for this).

Would it mean that DDL execution requires all (primary) nodes to be
running?  I don't say it's an immediate problem yet, but FYI, Oracle
Sharding records all DDL statements and re-send them to the down nodes
later.


> Doing this imposes a cost at
> DDL-execution-time only, which seems much better than imposing the
cost
> of translating name to OID on every server for every query.

Agreed.

Regards
MauMau




Re: I'd like to discuss scaleout at PGCon

2018-06-06 Thread MauMau
From: Alvaro Herrera
> Maybe an easy (hah) thing to do is use 2PC for DDL, agree on a OID
> that's free on every node, then create the object in all servers at
the
> same time.  We currently use the system-wide OID generator to assign
the
> OID, but seems an easy thing to change (much harder is to prevent
> concurrent creation of objects using the arranged OID; maybe can
reuse
> speculative tokens in btrees for this).

Would it mean that DDL execution requires all (primary) nodes to be
running?  I don't say it's an immediate problem yet, but FYI, Oracle
Sharding records all DDL statements and re-send them to the down nodes
later.


> Doing this imposes a cost at
> DDL-execution-time only, which seems much better than imposing the
cost
> of translating name to OID on every server for every query.

Agreed.

Regards
MauMau




Re: I'd like to discuss scaleout at PGCon

2018-06-06 Thread MauMau
From: Simon Riggs
On 5 June 2018 at 17:14, MauMau  wrote:

>> Furthermore, an extra hop and double parsing/planning could matter
for
>> analytic queries, too.  For example, SAP HANA boasts of scanning 1
>> billion rows in one second.  In HANA's scaleout architecture, an
>> application can connect to any worker node and the node
communicates
>> with other nodes only when necessary (there's one special node
called
>> "master", but it manages the catalog and transactions; it's not an
>> extra hop like the coordinator in XL).  Vertica is an MPP analytics
>> database, but it doesn't have a node like the coordinator, either.
To
>> achieve maximum performance for real-time queries, the scaleout
>> architecture should avoid an extra hop when possible.

> Agreed. When possible.
>
> When is it possible?
>
> Two ways I know of are:
>
> 1. have pre-knowledge in the client of where data is located
> (difficult to scale)
> 2. you must put data in all places the client can connect to (i.e.
multimaster)

Regarding 1, I understood you meant by "difficult to scale" that
whenever the user adds/removes a node from a cluster and data
placement changes, he has to change his application's connection
string to point to the node where necessary data resides.
Oracle Sharding provides a solution for that problem.  See "6.1 Direct
Routing to a Shard" in the following manual page:

https://docs.oracle.com/en/database/oracle/oracle-database/18/shard/sh
arding-data-routing.html#GUID-64CAD794-FAAA-406B-9E20-0C35E96D3FA8

[Excerpt]
"Oracle clients and connections pools are able to recognize sharding
keys specified in the connection string for high performance data
dependent routing. A shard routing cache in the connection layer is
used to route database requests directly to the shard where the data
resides."



> Perhaps there are more?

Please see 6.1.1 below.  The application can connect to multiple nodes
and distribute processing without an extra hop.  This is also an
interesting idea, isn't it?

https://docs.voltdb.com/UsingVoltDB/ChapAppDesign.php#DesignAppConnect
Multi


Regards
MauMau




Re: I'd like to discuss scaleout at PGCon

2018-06-06 Thread MauMau
From: Michael Paquier
> Greenplum's orca planner (and Citus?) have such facilities if I
recall
> correctly, just mentioning that pushing down directly to remote
nodes
> compiled plans ready for execution exists here and there (that's not
the
> case of XC/XL).  For queries whose planning time is way shorter than
its
> actual execution, like analytical work that would not matter much.
But
> not for OLTP and short transaction workloads.

It seems that Greenplum does:

https://greenplum.org/docs/580/admin_guide/query/topics/parallel-proc.
html#topic1

"The master receives, parses, and optimizes the query. The resulting
query plan is either parallel or targeted. The master dispatches
parallel query plans to all segments,..."

while Citus doesn't:

https://docs.citusdata.com/en/v7.4/develop/reference_processing.html#c
itus-query-processing

"Next, the planner breaks the query into two parts - the coordinator
query which runs on the coordinator and the worker query fragments
which run on individual shards on the workers. The planner then
assigns these query fragments to the workers such that all their
resources are used efficiently. After this step, the distributed query
plan is passed on to the distributed executor for execution.
...
Once the distributed executor sends the query fragments to the
workers, they are processed like regular PostgreSQL queries. The
PostgreSQL planner on that worker chooses the most optimal plan for
executing that query locally on the corresponding shard table. The
PostgreSQL executor then runs that query and returns the query results
back to the distributed executor."



BTW, the above page states that worker nodes directly exchanges data
during query execution.  Greenplum also does so among segment nodes to
join tables which are distributed by different key columns.  XL seems
to do so, too.  If this type of interaction is necessary, how would
the FDW approach do that?  The remote servers need to interact with
each other.

"The task tracker executor is designed to efficiently handle complex
queries which require repartitioning and shuffling intermediate data
among workers."



> Greenplum uses also a single-coordinator, multi-datanode instance.
That
> looks similar, right?

Greenplum uses a single master and multiple workers.  That's similar
to Citus.  But Greenplum is not similar to VoltDB nor Vertica, since
those allow applications to connect to any node.


>> Our proprietary RDBMS named Symfoware, which is not based on
>> PostgreSQL, also doesn't have an extra hop, and can handle
distributed
>> transactions and deadlock detection/resolution without any special
>> node like GTM.
>
> Interesting to know that.  This is an area with difficult problems.
At
> the closer to merge with Postgres head, the more fun (?) you get
into
> trying to support new SQL features, and sometimes you finish with
hard
> ERRORs or extra GUC switches to prevent any kind of inconsistent
> operations.

Yes, I hope our deadlock detection/resolution can be ported to
PostgreSQL.  But I'm also concerned like you, because Symfoware is
locking-based, not MVCC-based.

Regards
MauMau




Re: I'd like to discuss scaleout at PGCon

2018-06-06 Thread MauMau
From: Ashutosh Bapat
> Keeping OIDs same across the nodes would require extra communication
> between nodes to keep track of next OID, dropped OIDs etc. We need
to
> weigh the time spent in that communication and the time saved during
> parsing.

If we manage the system catalog for cluster-wide objects (tables,
indexes, databases, users, and almost all other objects) on one
central node, and separate the OID space for cluster-wide objects from
that for the node-local objects (temporary tables, tablespaces, and
what else?), we don't need to keep track of OIDs across nodes.


>>  In XL, the data
>> node returns the data type names of the columns in the result set
to
>> the coordinator.  Then the coordinator seemed to parse each data
type
>> name with base_yyparse() to convert the name to its OID on the
>> coordinator.  That's why base_yyparse() appeared at the top in the
>> perf profile.
>
> I do not understand, why do we need base_yyparse() to parse type
name.
> We already have functions specifically for parsing object names.

Looking at the XL source code, the following sequence of functions are
called when the coordinator handles the Row Description message ('T')
from the data node.  I guess the parsing is necessary to process type
names combined with type modifiers, e.g. "char(100)".

create_tuple_desc
parseTypeString
typeStringToTypeName
raw_parser


Regards
MauMau




Re: I'd like to discuss scaleout at PGCon

2018-06-05 Thread MauMau
From: Simon Riggs
On 1 June 2018 at 16:56, Ashutosh Bapat
 wrote:
>> I think partitioning + FDW provide basic infrastructure for
>> distributing data, planning queries working with such data. We need
>> more glue to support node management, cluster configuration. So, I
>> agree with your statement. But I think it was clear from the
beginning
>> that we need more than FDW and partitioning.
>
> No, it wasn't clear. But I'm glad to hear it. It might actually work
then.

I found a possibly interesting description in the XL manual.  Although
XL performs various pushdowns like FDW, XL seems to perform some kind
of joins with datanode-to-datanode communication.  Doesn't this prove
that the FDW approach can't handle those joins optimally?  What kind
of joins use the shared queue?



https://www.postgres-xl.org/documentation/pg-xc-specifics.html
--
shared_queues (integer)
Datanode Only

For some joins that occur in queries, data from one Datanode may need
to be joined with data from another Datanode. Postgres-XL uses shared
queues for this purpose. During execution each Datanode knows if it
needs to produce or consume tuples, or both.

Note that there may be mulitple shared_queues used even for a single
query. So a value should be set taking into account the number of
connections it can accept and expected number of such joins occurring
simultaneously.
------


Regards
MauMau




Re: I'd like to discuss scaleout at PGCon

2018-06-05 Thread MauMau
From: Merlin Moncure
> FWIW, Distributed analytical queries is the right market to be in.
> This is the field in which I work, and this is where the action is
at.
> I am very, very, sure about this.  My view is that many of the
> existing solutions to this problem (in particular hadoop class
> soltuions) have major architectural downsides that make them
> inappropriate in use cases that postgres really shines at; direct
> hookups to low latency applications for example.  postgres is
> fundamentally a more capable 'node' with its multiple man-millennia
of
> engineering behind it.  Unlimited vertical scaling (RAC etc) is
> interesting too, but this is not the way the market is moving as
> hardware advancements have reduced or eliminated the need for that
in
> many spheres.

I'm feeling the same.  As the Moore's Law ceases to hold, software
needs to make most of the processor power.  Hadoop and Spark are
written in Java and Scala.  According to Google [1] (see Fig. 8), Java
is slower than C++ by 3.7x - 12.6x, and Scala is slower than C++ by
2.5x - 3.6x.

Won't PostgreSQL be able to cover the workloads of Hadoop and Spark
someday, when PostgreSQL supports scaleout, in-memory database,
multi-model capability, and in-database filesystem?  That may be a
pipedream, but why do people have to tolerate the separation of the
relational-based data  warehouse and Hadoop-based data lake?


[1]Robert Hundt. "Loop Recognition in C++/Java/Go/Scala".
Proceedings of Scala Days 2011

Regards
MauMau




Re: I'd like to discuss scaleout at PGCon

2018-06-05 Thread MauMau
From: Ashutosh Bapat
> Each node need to be confiugred and maintained. That requires
efforts.
> So we need to keep the number of nodes to a minimum. With a
> coordinator and worker node segregation, we require at least two
nodes
> in a cluster and just that configuration doesn't provide much
> scalability. With each node functioning as coordinator (facing
> clients) and worker (facing other coordinators) keeps the number of
> nodes to a minimum. It is good for HA.

I think so, too.  Every node should behave as both the coordinator and
the data node in XL parlance.  But I don't preclude a central node.
Some node needs to manage sequences, and it may as well manage the
system catalog.


Regards
MauMau




Re: I'd like to discuss scaleout at PGCon

2018-06-05 Thread MauMau
From: Ashutosh Bapat
> In order to normalize parse trees, we need to at
> least replace various OIDs in parse-tree with something that the
> foreign server will understand correctly like table name on the
> foreign table pointed to by local foreign table OR (schema
qualified)
> function names  and so on.

Yes, that's the drawback of each node in the cluster having different
OIDs for the same object.  That applies to XL, too.  In XL, the data
node returns the data type names of the columns in the result set to
the coordinator.  Then the coordinator seemed to parse each data type
name with base_yyparse() to convert the name to its OID on the
coordinator.  That's why base_yyparse() appeared at the top in the
perf profile.  That repeated typename-to-OID conversion would be
reduced by caching the conversion result, like the logical replication
of PostgreSQL does.  But managing the catalog at one place and using
the same OID values seems to concise to me as a concept.


Regards
MauMau


-Original Message- 
From: Ashutosh Bapat
Sent: Saturday, June 2, 2018 1:00 AM
To: Tom Lane
Cc: MauMau ; Robert Haas ; PostgreSQL Hackers
Subject: Re: I'd like to discuss scaleout at PGCon

On Fri, Jun 1, 2018 at 11:27 AM, Tom Lane  wrote:
> Ashutosh Bapat  writes:
>> In order to avoid double parsing, we might want to find a way to
pass
>> a "normalized" parse tree down to the foreign server. We need to
>> normalize the OIDs in the parse tree since those may be different
>> across the nodes.
>
> I don't think this is a good idea at all.  It breaks any hope of
> supporting remote servers that are not the identical version to the
local
> one (since their parsetrees might be different).  And "normalized
OIDs"
> sounds like "pie in the sky".  You might get away with asssuming
that
> built-in functions have stable OIDs, but you can't expect that for
> functions in extensions.

Sorry for confusing writeup. I didn't mean "normalized OIDs" as I
mentioned in my last sentence. I meant "normalized parse-tree" as in
the first sentence. In order to normalize parse trees, we need to at
least replace various OIDs in parse-tree with something that the
foreign server will understand correctly like table name on the
foreign table pointed to by local foreign table OR (schema qualified)
function names  and so on. There might be more things to "normalize"
in the parse tree other than OIDs, but I can't think of anything right
now.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company




Re: I'd like to discuss scaleout at PGCon

2018-06-05 Thread MauMau
From: Ashutosh Bapat
> In order to normalize parse trees, we need to at
> least replace various OIDs in parse-tree with something that the
> foreign server will understand correctly like table name on the
> foreign table pointed to by local foreign table OR (schema
qualified)
> function names  and so on.

Yes, that's the drawback of each node in the cluster having different
OIDs for the same object.  That applies to XL, too.  In XL, the data
node returns the data type names of the columns in the result set to
the coordinator.  Then the coordinator seemed to parse each data type
name with base_yyparse() to convert the name to its OID on the
coordinator.  That's why base_yyparse() appeared at the top in the
perf profile.  That repeated typename-to-OID conversion would be
reduced by caching the conversion result, like the logical replication
of PostgreSQL does.  But managing the catalog at one place and using
the same OID values seems to concise to me as a concept.


Regards
MauMau




Re: I'd like to discuss scaleout at PGCon

2018-06-05 Thread MauMau
From: Robert Haas
On Thu, May 31, 2018 at 8:12 AM, MauMau  wrote:
>> Oh, I didn't know you support FDW approach mainly for analytics.  I
>> guessed the first target was OLTP read-write scalability.
>
> That seems like a harder target to me, because you will have an
extra
> hop involved -- SQL from the client to the first server, then via
SQL
> to a second server.  The work of parsing and planning also has to be
> done twice, once for the foreign table and again for the table.  For
> longer-running queries this overhead doesn't matter as much, but for
> short-running queries it is significant.


From: Simon Riggs
On 1 June 2018 at 04:00, MauMau  wrote:
>> The SQL processor should be one layer, not two layers.

> For OLTP, that would be best. But it would be restricted to
> single-node requests, leaving you the problem of how you know ahead
of
> time whether an SQL statement was single node or not.
>
> Using a central coordinator node allows us to hide the decision of
> single-node/multi-node from the user which seems essential for
general
> SQL. If you are able to restrict the types of requests users make
then
> we can do direct access to partitions - so there is scope for a
> single-node API, as Mongo provides.

I don't think an immediate server like the coordinators in XL is
necessary.  That extra hop can be eliminated by putting both the
coordinator and the data node roles in the same server process.  That
is, the node to which an application connects communicates with other
nodes only when it does not necessary data.

Furthermore, an extra hop and double parsing/planning could matter for
analytic queries, too.  For example, SAP HANA boasts of scanning 1
billion rows in one second.  In HANA's scaleout architecture, an
application can connect to any worker node and the node communicates
with other nodes only when necessary (there's one special node called
"master", but it manages the catalog and transactions; it's not an
extra hop like the coordinator in XL).  Vertica is an MPP analytics
database, but it doesn't have a node like the coordinator, either.  To
achieve maximum performance for real-time queries, the scaleout
architecture should avoid an extra hop when possible.


> Using a central coordinator also allows multi-node transaction
> control, global deadlock detection etc..

VoltDB does not have an always-pass hop like the coordinator in XL.
Our proprietary RDBMS named Symfoware, which is not based on
PostgreSQL, also doesn't have an extra hop, and can handle distributed
transactions and deadlock detection/resolution without any special
node like GTM.


Regards
MauMau




Re: I'd like to discuss scaleout at PGCon

2018-06-05 Thread MauMau
From: Simon Riggs
> Passing detailed info between servers is exactly what XL does.
>
> It requires us to define a cluster, exactly as XL does.
>
> And yes, its a good idea to replicate some tables to all nodes, as
XL does.
>
> So it seems we have at last some agreement that some of the things
XL
> does are the correct approaches.

Exactly.  I was impressed when I read the paper on XC for the first
time.  I respect what Suzuki-san and people from NTT, EnterpriseDB,
and 2ndQuadrant have done for XC/XL.  Just like some people believe we
should leverage the artifact for FDW, I think we should leverage the
code and idea of XC/XL.

Regards
MauMau




Re: New committers announced at PGCon 2018

2018-06-03 Thread MauMau
From: Tom Lane
The core team is pleased to announce the appointment of seven
new Postgres committers:


Congratulations to 7 cool generals!  Let's help new committers by
doing good review and test, and let's aim for committers.

FYI, I had a quick look at how many years the new committers have
contributed to the community.  I used the following archive search
with the search term "From: ".  I'm afraid this is an
approximation, because the maximum hit count is 1,000.  There must
have been long histories...

https://www.postgresql.org/search/?m=1&ln=pgsql-hackers


Etsuro Fujita
7 years, 2011-09-12
https://www.postgresql.org/message-id/4e6de18b.4010...@lab.ntt.co.jp

Peter Geoghegan
8 years, 2010-02-11
https://www.postgresql.org/message-id/db471ace1002110833v708f7f60xd07e
9534cdb3c...@mail.gmail.com

Amit Kapila
7 years, 2011-09-07
https://www.postgresql.org/message-id/DEA262E072764EBD82E0916334414EE3
@china.huawei.com

Alexander Korotkov
8 years, 2010-05-12
https://www.postgresql.org/message-id/1273690962-sup-2...@alvh.no-ip.o
rg

Thomas Munro
7 years, 2011-11-06
https://www.postgresql.org/message-id/CADLWmXXSBRt-4kKFFbR4jUdVskZEF4O
b5l_asq+e7qx1p1d...@mail.gmail.com

Michael Paquier
9 years, 2009-08-07
https://www.postgresql.org/message-id/c64c5f8b0908062031k3ff48428j824a
9a46f2818...@mail.gmail.com

Tomas Vondra
11 years, 2007-01-25
https://www.postgresql.org/message-id/20070125155424.gg64...@nasby.net


Regards
MauMau





Re: I'd like to discuss scaleout at PGCon

2018-05-31 Thread MauMau
2018-05-31 22:44 GMT+09:00, Robert Haas :
> On Thu, May 31, 2018 at 8:12 AM, MauMau  wrote:
>> Oh, I didn't know you support FDW approach mainly for analytics.  I
>> guessed the first target was OLTP read-write scalability.
>
> That seems like a harder target to me, because you will have an extra
> hop involved -- SQL from the client to the first server, then via SQL
> to a second server.  The work of parsing and planning also has to be
> done twice, once for the foreign table and again for the table.  For
> longer-running queries this overhead doesn't matter as much, but for
> short-running queries it is significant.

Yes, that extra hop and double parsing/planning were the killer for
our performance goal when we tried to meet our customer's scaleout
needs with XL.  The application executes 82 DML statements in one
transaction.  Those DMLs consist of INSERT, UPDATE and SELECT that
only accesses one row with a primary key.  The target tables are only
a few, so the application PREPAREs a few statements and EXECUTEs them
repeatedly.  We placed the coordinator node of XL on the same host as
the application, and data nodes and GTM on other individual nodes.

The response time of XL compared to PostgreSQL was 2.4 times, and the
throughput (tps) was 43%.  Interestingly, perf showed that
base_yyparse() was the top CPU consumer on both coordinator and data
node, while base_yyparse() appeared near the bottom of the ranking.
The SQL processor should be one layer, not two layers.

In the above benchmark, each transaction only accessed data on one
data node.  That's what sharding principles recommend.  The FDW
approach would be no problem as long as the application follows the
sharding recommendation.

But not all applications will/can follow the sharding recommendation.
The above application, which is migrated from a mainframe, uses
INSERTs to load data, inserting rows onto various nodes.  Considering
your concern of double parsing/planning for a local foreign table and
a remote real table, wouldn't the FDW approach hit the wall?


> I don't know what "node management" and "failure dectection/failover"
> mean specifically.  I'd like to hear proposals, though.

That's nothing special or new.  Things like:

* Define a set of nodes that can join the cluster.
* Initialize or configure a node according to its role in the cluster.
* Decommission a node from the cluster.
* Define a node group in which all member nodes have the same data set
for redundancy.
* One command to start and shutdown the entire cluster.
* System tables to display the member nodes and node groups.
* Each node's in-memory view of the current cluster state.
* How each node monitors which other nodes.
* Elect a new primary node within a node group when the current
primary node fails.
* Whether each node group should be configured with a master-slaves
replication topology, or a multi-master topology like MySQL Group
Replication

Some of the above may end up with XL's things like
pgxc_node/pgxc_group system tables, pgxc_ctl command, CREATE/DROP
NODE/NODE GROUP commands, etc.


Regards
MauMau



Re: I'd like to discuss scaleout at PGCon

2018-05-31 Thread MauMau
ned
> like a foreign table by connecting to a node on which it is present.
> The set of nodes on which a table is present is metadata that is
> shared throughout the cluster.  Multi-master logical replication
> propagates changes between all nodes on which the table is present.
> With a concept like this, there is a lot of opportunity to optimize
> queries by, for example, deciding to perform a join on a node on which
> both input tables are present, to minimize data movement.

I agree.  XL, Oracle Sharding, and possibly MySQL Cluster does that,
too.  It seems like a must-do thing.


> But even if
> we create something like this, I see it as fundamentally an extension
> of the FDW approach that would hopefully manage to reuse a lot of
> what's already been built there.  I don't think we need to (or should)
> throw away the work that's been done on FDW pushdown and start over --
> we should find a way to build on top of it and add ease-of-use and
> management features.

Agreed.  I think we should not write much code from scratch, too.  On
the other hand, if we have to support sharding natively without FDW, I
wonder if we can reuse the FDW artifact.  I mean, extracting necessary
logics from FDW into common functions, and native sharding code also
calls them.


> In fact, even if we said that we want a fully distributed database,
> we'd probably still need some kind of distributed table concept.
> Unless every node has a full copy of everything in the database, you
> still need to reason about which data is present on which nodes and
> optimize queries accordingly.

Then, how about building the cluster membership management first,
including node management and failure detection/failover?  I think
that node management is necessary anyway, and other developers can
experiment other things on that cluster infrastructure.  Do you think
it would be helpful or wasteful?  I'm finding what we can do for early
scaleout release.


> From the chart view, in February 2016, SQL Server was at 1150.227, and
> MongoDB was at 305.599.  Generally it looks like the "big three" --

Thank you for looking at the chart and telling me the figures.

> I
> think it's pretty clear that we need to both continue to improve some
> of these major new features we've added and at the same time keep
> introducing even more new things if we want to continue to gain market
> share and mind share.  I hope that features like scale-out and also
> zheap are going to help us continue to whittle away at the gap, and I
> look forward to seeing what else anyone may have in mind.

Definitely.  I couldn't agree more.


Regards
MauMau



I'd like to discuss scaleout at PGCon

2018-05-26 Thread MauMau
Hello,

I'm going to attend PGCon in Ottawa for the first time.  I am happy if
I can meet you.

Because I'm visually impaired, I only have vision to sense light.  If
you see a Japanese man with a height of 171 cm with a white cane, it's
probably me.  I'd be happy if you talk to me.  But as I'm still far
from good at listening and speaking English, I'm sorry if I take an
unfriendly attitude or if I can not keep on talking for a long time.


I'd like to have a session on scaleout design at the unconference.
I've created a wiki page for that (this is still just a memo; I'd like
to populate this page with you as the discussion in the community
progresses).  I'd appreciate it if someone could stand with me and
facilitate the discussion at the unconference.

https://wiki.postgresql.org/wiki/Scaleout_Design

The background is ... our company is faced with an immediate need to
develop the read-write scaleout feature on PostgreSQL.  We tried
Postgres-XL with much hope, but we found it difficult to achieve our
performance goal.  I will tell you the details at the conference.  But
personally, Postgres-XL seems to be very nice software, and I feel
that good parts of it should be integrated into core.

I know that many great hackers from 2ndQuadrant, EnterpriseDB, NTT,
Postgres Professional, CitusData, and so on are addressing this
difficult scaleout feature.  I don't think yet we are competent to
lead this development.

On the other hand, we have a proprietary RDBMS called Symfoware (I'm
sure you don't know it), which is not based on PostgreSQL, that
provides the scaleout feature.  Its architecture is a mix of shared
nothing and shared everything.  It implements deadlock detection and
resolution without a central node or periodic monitoring, parallel 2PC
across nodes, parallel crash recovery, client connection routing and
failover without any overhead of intermediary middleware during SQL
execution, etc.  So we may be able to help in some way.  I'd be happy
if we could help the community to proceed with development of
scaleout.

If you have a session for scaleout outside the unconference, could you
call me and let me join it?


By the way, the popularity score of PostgreSQL finally exceeded 400
points in the DB-Engines ranking!  The popularity difference with the
top products has shrunk greatly.  Let's make PostgreSQL more popular.

https://db-engines.com/en/ranking

[as of May 27, 2018]
Oracle=1290.42  MySQL=1223.34  SQL Server=1085.84
PostgreSQL=400.90  MongoDB=342.11
(Oracle / PostgreSQL ratio is 3.2)

[as of Feb 2016, according to a memo at hand]
Oracle=1476.14  MySQL=1321.13  SQL Server=??
MongoDB=??  PostgreSQL=288.66
(Oracle / PostgreSQL ratio is 5.1)


Regards
MauMau




Re: Truncation failure in autovacuum results in data corruption (duplicate keys)

2018-04-18 Thread MauMau
From: Tom Lane
[ re-reads thread... ]  The extra assumption you need in order to have
trouble is that the blocks in question are dirty in shared buffers and
have never been written to disk since their rows were deleted.  Then
the situation is that the page image on disk shows the rows as live,
while the up-to-date page image in memory correctly shows them as
dead.
Relation truncation throws away the page image in memory without ever
writing it to disk.  Then, if the subsequent file truncate step fails,
we have a problem, because anyone who goes looking for that page will
fetch it afresh from disk and see the tuples as live.


Thank you so much, I got it!  And I'm always impressed at how quick
and concise you are, while you are busy addressing multiple issues and
answering user questions.  Maybe I wouldn't be surprised to hear that
there are multiple clones of Tom Lane.

I'd like to continue to think of a solution and create a patch, based
on the severity and how the customer will respond to our answer.  I
have a feeling that we have to say it's a bit serious, since it
requires recovery from a base backup, not just rebuilding indexes.
The patch development may be after PGCon.

Regards
MauMau






Truncation failure in autovacuum results in data corruption (duplicate keys)

2018-04-18 Thread MauMau
Hello,

It seems to me that our customer might have hit an unresolved data
corruption issue which is already known in this ML, but I can't figure
out why this happened.  I'd appreciate if you could give me your
thoughts.  Depending on the severity of this issue and the customer's
request, I think I'll submit a patch to solve the unresolved issue.

The customer is using PostgreSQL 9.2 on Windows.  Autovacuum failed to
truncate the pages at the end of a table, which is probably due to
anti-virus software.  FYI, automatic checkpoint was in progress when
this error occurred.

ERROR:  could not truncate file "base/aaa/bbb" to 58 blocks:
Permission denied

After a while, an application got a unique key violation error.  The
customer says that there shouldn't be any duplicate keys.

ERROR:  duplicate key value violates unique constraint "pk_xxx"

The output of pg_dump on that table certainly includes multiple couple
of rows with the same primary key values... data corruption.



Another Japanese user, who is not our customer, hit the same problem
with 9.4, which was not solved (note: the mail is in Japanese).  He
said he repeatedly encountered the same error even after REINDEXing
with 9.4, but it doesn't happen with 9.1.  I wonder if there's
something introduced in 9.2 which causes the issue, such as index-only
scan stuff:

https://ml.postgresql.jp/pipermail/pgsql-jp/2016-October/016865.html


The problem with truncation failure was found in 2010.  The user
reported the problem as another phenomenon on 9.0.  The problem could
not be solved even by leading hackers here -- Tom, Robert, Alvaro,
Heikki, Greg Stark, etc.

TODO
https://wiki.postgresql.org/wiki/Todo

Restructure truncation logic to be more resistant to failure
This also involves not writing dirty buffers for a truncated or
dropped relation
http://archives.postgresql.org/pgsql-hackers/2010-08/msg01032.php



Tom's comments in above thread

Imagine that we have some rows at the end of a table, we delete them,
we vacuum before the next checkpoint.  Vacuum decides it can now
truncate away the last pages, but fails to do so.  The original page
state is still on disk, which means we have lost the fact of the
deletion --- the rows are now effectively live again, though their
index entries are probably gone.

...
Still, we have a live issue with heap truncation during plain VACUUM.
However, the scope of the problem seems a lot less than I was
thinking.
Maybe write-the-buffers-first is a sufficient longterm solution.

...
So it seems like the only case where there is really grounds for PANIC
on failure is the VACUUM case.  And there we might apply Heikki's idea
of trying to zero the untruncatable pages first.

...
I'm thinking that we need some sort of what-to-do-on-error flag passed
into RelationTruncate, plus at least order-of-operations fixes in
several other places, if not a wholesale refactoring of this whole
call
stack.  But I'm running out of steam and don't have a concrete
proposal
to make right now.  In any case, we've got more problems here than
just
the original one of forgetting dirty buffers too soon.



However, I have a question.  How does the truncation failure in
autovacuum lead to duplicate keys?  The failed-to-be-truncated pages
should only contain dead tuples, so pg_dump's table scan should ignore
dead tuples in those pages.

Regards
MauMau






Re: [bug fix] ECPG: freeing memory for pgtypes crashes on Windows

2018-03-25 Thread MauMau
From: Kyotaro HORIGUCHI 
The objective of this patch looks reasonable and this doesn't
affect ecpg applications except for the problematic case that
happens only on Windows. So the points here are only the
documentation, the new function name and the how we should place
the new defintion.

I think this doesn't need more profound review so I'll mark this
Ready For Commit after confirming the amendment.


I'm sorry for my late reply.  Last week I was off for a week.

And thank you for your review.  All modifications are done.


Regards
MauMau


pgtypes_freemem_v3.patch
Description: Binary data


Re: Changing default value of wal_sync_method to open_datasync on Linux

2018-03-24 Thread MauMau
From: Robert Haas
I also said it would be worse on spinning disks.

Also, Yoshimi Ichiyanagi did not find it to be true even on NVRAM.


Yes, let me withdraw this proposal.  I couldn't see any performance
difference even with ext4 volume on a PCIe flash memory.

Regards
MauMau




Re: [RFC] What would be difficult to make data models pluggable for making PostgreSQL a multi-model database?

2017-12-03 Thread MauMau
From: Henry
Would this require a the new pluggable storage which is currently in
development or would the existing storage engine be sufficient? I am
just wondering if there are any rough design/plans for this...

I'm sorry for the long interval.  The graph model can be implemented
on top of the relational storage engine, like Oracle and SQL Server
stores graph data in relational tables.  But the optimal storage
engine for the graph model is required to maximize performance for
graph traversal, which introduces direct pointers between graph nodes.
So I think new level of pluggable storage is necessary.

Regards
MauMau