RE: Copy data to DSA area

2019-07-12 Thread Ideriha, Takeshi
Hi, thank you for the previous two emails. 

Thomas Munro  wrote:
>What do you think about the following?  Even though I know you want to start 
>with
>much simpler kinds of cache, I'm looking ahead to the lofty end-goal of having 
>a shared
>plan cache.  No doubt, that involves solving many other problems that don't 
>belong in
>this thread, but please indulge me:

My initial motivation came from shared catcache and relcache but I also think 
shared plan
cache is one of the big topics and I'd be very excited if it's come true. 
Sometimes making 
plan at each backend becomes enormous overhead for speed.

>On Wed, Jul 10, 2019 at 6:03 PM Thomas Munro  wrote:
>> Hmm.  I wonder if we should just make ShmContextFree() do nothing! And
>> make ShmContextAlloc() allocate (say) 8KB chunks (or larger if needed
>> for larger allocation) and then hand out small pieces from the
>> 'current' chunk as needed.  Then the only way to free memory is to
>> destroy contexts, but for the use case being discussed, that might
>> actually be OK.  I suppose you'd want to call this implementation
>> something different, like ShmRegionContext, ShmZoneContext or
>> ShmArenaContext[1].
>
>
>
>I guess what I said above is only really appropriate for complex things like 
>plans that
>have their own contexts so that we can delete them easily "in bulk".  I guess 
>it's not
>true for caches of simpler objects like catcache, that don't want a context 
>for each
>cached thing and want to free objects "retail" (one by one).  So I guess you 
>might
>want something more like your current patch for (say) SharedCatCache, and
>something like the above-quoted idea for (say) SharedPlanCache or 
>SharedRelCache.

Here's what I get: 
- Temporary/Permanent concept can be implemented by following current cached 
plan
  schema of reparanting and also adding general API MemoryContextClone.

- There are at least two strategies (implementation ways) of allocation. One is 
  ShmZoneContext; region-based allocation, which drops chunks in bulk. This 
type is
  useful for cache made of complex objects graph like plan cache. This case 
clean up
  list (list of pointers to dsa_allocated objects) needs to be located in 
shared memory.
  That's because this list is used not only in error case but in case of 
DropCachedPlan.

- The other is not region based, but allocate or free objects "retail". I would 
call it
  ShmSimpleContext case. This would be suitable for, say, catalog cache.  

The type of ShmSimpleContext is OK to have clean up list in local memory, right?
Unlike ShmZoneContext, this list is only used on error and cache entry can be
freed by just traversing simple object graph as discussed before.

Regarding the tracing list for cleanup, I'd change the design based on either
ShmSimpleContext or ShmZoneContext assuming the clean-up list would be 
pointed by the context object in both type.

In type of ShmSimpleContext, the list is deleted after the context is reparent
to long lived one. The context for building, say, catcache entry is 
located in shared memory but its tracking list is in local memory.
So it seems wired to that the shared object points to local object. But there
seems no problem because other process cannot touch the context until
its reparent.

In case of ShmZoneContext, it would be ok to still keep the list after its 
reparent.
That's because both this intermediate context and the list is supposed be 
located in shared memory and used.

Regarding Intermediate level context, I was wondering if we need it in 
ShmSimpleContext case. Every time we build the catcache entry, we make 
MemoryContext object in shared memory and this becomes some memory 
Overhead. But right now context object per catcache entry seems simple 
architecture and moreover easier to operate on error case.


>For an implementation that supports retail free, perhaps you could store the 
>address of
>the clean-up list element in some extra bytes before the returned pointer, so 
>you don't
>have to find it by linear search.  Next, I suppose you don't want to leave 
>holes in the
>middle of the array, so perhaps instead of writing NULL there, you could 
>transfer the
>last item in the array to this location (with associated concurrency problems).

Sure, I'll fix it.

>Since I don't think anyone ever said it explicitly, the above discussion is 
>all about how
>we get to this situation, while making sure that we're mostly solving problems 
>that
>occur in both multi-process and multi-threaded designs:
>
>shared_metadata_cache = '100MB'
>shared_plan_cache = '100MB'

Yes, we are talking about this. Thank you for the clarification.

Regards,
Takeshi Ideriha



RE: Global shared meta cache

2019-06-26 Thread Ideriha, Takeshi
Hi, everyone.

>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>My current thoughts:
>- Each catcache has (maybe partial) HeapTupleHeader
>- put every catcache on shared memory and no local catcache
>- but catcache for aborted tuple is not put on shared memory
>- Hash table exists per kind of CatCache
>- These hash tables exists for each database and shared
>  - e.g) there is a hash table for pg_class of a DB

I talked about shared CatCache (SysCache) with Thomas at PGCon and he
suggested using sinval to control cache visibility instead of xid. 
Base on this I've changed my design. I'll send some PoC patch in a week 
but share my idea beforehand. I'm sorry this email is too long to read 
but I'm happy if you have some comments.

Basically I won't make shared catcache as default, make it as option.

Both local and shared memory has hash tables of catcache. A shared hash 
entry is catctup itself and a local hash entry is a pointer to the 
shared catctup. Actually, local hash entry does not hold a direct pointer
but points to a handle of shared catctup. The handle points to shared 
catctup and is located in shared memory. This is intended to avoid
dangling pointer of local hash entry due to eviction of shared catctup
by LRU. ( The detail about LRU will be written in another email because 
I'll implement it later.) 

* Search and Insert
Current postgres searches (local) hash table and if it's missed, search
the actual catalog (shared buffer and disk) and build the cache; build 
the negative cache if not found. 

In new architecture, if cache is not found in local hash table, postgres 
tries to search shared one before consulting shared buffer. Here is a 
detail. To begin with, postgres looks up the pointer in local hash 
table. If it's found, it references the pointer and gets catctup. If 
not, it searches the shared hash table and gets catctup and insert
its pointer into local hash table if the catctup is found. If it doesn't
exist in shared hash table either, postgres searches actual catalog and
build the cache and in most cases insert it into shared hash table 
and its pointer to local one. The exception case is that the cache
is made from uncommitted catalog tuple, which must not be seen from 
other process. So an uncommitted cache is built in local memory and
pushed directly into local table but not shared one. Lastly, if there
is no tuple we're looking for, put negative tuple into shared hash table.

* Invalidation and visibility control
Now let's talk about invalidation. Current cache invalidation is based
on local and shared invalidation queue (sinval). When transaction is 
committed, sinval msg is queued into shared one. Other processes read and
process sinval msgs at their own timing.

In shared catcache, I follow the current sinval in most parts. But I'll 
change the action when sinval msg is queued up and read by a process. 
When messages are added to shared queue, identify corresponding shared 
caches (matched by hash value) and turn their "obsolete flag" on. When 
sinval msg is read by a process, each process deletes the local hash 
entries (pointer to handler). Each process can see a shared catctup as
long as its pointer (local entry) is valid. Because sinval msgs are not
processed yet, it's ok to keep seeing the pointer to possibly old 
cache. After local entry is invalidated, its local process tries
to search shared hash table to always find a catctup whose obsolete flag
is off. The process can see the right shared cache after invalidation
messages are read because it checks the obsolete flag and also 
uncommitted cache never exists in shared memory at all. 

There is a subtle thing here. Always finding a shared catctup without
obsolete mark assumes that the process already read the sinval msgs. So
before trying to search shared table, I make the process read sinval msg.
After it's read, local cache status becomes consistent with the action
to get a new cache. This reading timing is almost same as current postgres
behavior because it's happened after local cache miss both in current 
design and mine. After cache miss in current design, a process opens
the relation and gets a heavyweight lock. At this time, in fact, it reads
the sinval msgs. (These things are well summarized in talking by Robert
Haas at PGCon[1]). 

Lastly, we need to invalidate a shared catctup itself at some point. But
we cannot delete is as long as someone sees it. So I'll introduce
refcounter. It's increased or decreased at the same timing when 
current postgres manipulates the local refcounter of catctup and catclist
to avoid catctup is deleted while catclist is used or vice versa (that
is SearchCatCache/RelaseCatCache). So shared catctup is deleted when
its shared refcount becomes zero and obsolete flag is on. Once it's 
vanished from shared cache, the obsolete cache never comes back again
because a process which tries to get cache but fails in shared hash table 
alread

RE: Copy data to DSA area

2019-06-24 Thread Ideriha, Takeshi
>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>Sent: Friday, April 26, 2019 11:50 PM
>To: 'Kyotaro HORIGUCHI' ;
>thomas.mu...@enterprisedb.com; robertmh...@gmail.com
>Cc: pgsql-hack...@postgresql.org
>Subject: RE: Copy data to DSA area
>
>Hi, I've updated Thomas's quick PoC.

Hi.

I've rebased the patch to fit the core code rather than extension.
Regarding shared memory context (ShmContext), I added three 
APIs:
- CreatePermShmContext
  create "permanent" context located in shared memory
- CreateTempShmContext
  Create "temporary" context located in local memory,
  Which has buffer to keep track of possible memory leak objects
- ChangeToPermShmContext
  Change allocated objects parent to permanent context

When you allocate something, add an object in the Temp ShmContext, 
and re-parent it to Perm ShmContext after it becomes not leaked.

Current method of keeping track of objects and freeing them at
rollback works well for the case where delete both the parent object 
and child object, which is pointed by parent. This is because no dangling 
pointer remains after rollback. If child object is freed but parent object
was already allocated in the permeant context, this object has a
dangling pointer. But if an object is pointed by already permanent object,
this means that just allocated object won't be leaked. So in such cases
we could skip temporary allocation and allocate it directory to permanent
context. At rollback case, we could just leave it in the shared memory 
and could make upper layer function handle its "ghost" object in a good way.
I cannot see the solution around here clearly. Do you have any thoughts?

MemoryContextMethods are not fully supported but some methods
like stats() might be needed. 

Current test case is very simple and same as previous one but
implemented as isolation test. It checks if interger can be set in 
shared memory and get it by another process. Actually, current test
case doesn't cover all possible case so more cases should be added.

I'll add this coming CF.

P.S 
Thomas, thank you for taking some time at PGCon 
to discuss this item and shared catalog cache. It was very helpful.
I'm going to submit email about shared catcache soon.

Regards,
Takeshi Ideriha


0001-Shared-memory-context-backed-by-DSA-and-its-test.patch
Description: 0001-Shared-memory-context-backed-by-DSA-and-its-test.patch


RE: Global shared meta cache

2019-05-22 Thread Ideriha, Takeshi
>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>[TL; DR]
>The basic idea is following 4 points:
>A. User can choose which database to put a cache (relation and catalog) on 
>shared
>memory and how much memory is used 
>B. Caches of committed data are on the
>shared memory. Caches of uncommitted data are on the local memory.
>C. Caches on the shared memory have xid information (xmin, xmax) 
>D. Evict not recently used cache from shared memory

I updated some thoughts about B and C for CatCache.
I would be very happy if you put some comments.

>[B & C]
>Regarding B & C, the motivation is we don't want other backends to see 
>uncommitted
>tables.
>Search order is local memory -> shared memory -> disk.
>Local process searches cache in shared memory based from its own snapshot and 
>xid
>of cache.
>When cache is not found in shared memory, cache with xmin is made in shared
>memory ( but not in local one).
>
>When cache definition is changed by DDL, new cache is created in local one, 
>and thus
>next commands refer to local cache if needed.
>When it's committed, local cache is cleared and shared cache is updated. This 
>update
>is done by adding xmax to old cache and also make a new one with xmin. The idea
>behind adding a new one is that newly created cache (new table or altered 
>table) is
>likely to be used in next transactions. At this point maybe we can make use of 
>current
>invalidation mechanism, even though invalidation message to other backends is 
>not
>sent.

My current thoughts:
- Each catcache has (maybe partial) HeapTupleHeader
- put every catcache on shared memory and no local catcache
- but catcache for aborted tuple is not put on shared memory
- Hash table exists per kind of CatCache
- These hash tables exists for each database and shared
  - e.g) there is a hash table for pg_class of a DB

Why I'm leaning toward not to use local cache follows:
- At commit moment you need to copy local cache to global cache. This would 
delay
  the response time.
- Even if uncommitted catcache is on shared memory, other transaction cannot 
  see the cache. In my idea they have xid information and visibility is checked 
  by comparing xmin, xmax of catcache and snapshot.  

OK, then if we put catcache on shared memory, we need to check their visibility.
But if we use the exact same visibility check mechanism as heap tuple,
it takes much more steps compared to current local catcache search.
Current visibility check is based on snapshot check and commit/abort check.
So I'm thinking to only put in-progress caches or committed one. This would
save time for checking catcache status (commit/abort) while searching cache.
But basically I'm going to use current visibility check mechanism except commit/
abort check (in other words check of clog).

These are how it works.
- When creating a catcache, copy heap tuple with heapTupleHeader 
- When update/delete command for catalog tuple is finished, 
  update xmax of corresponding cache 
- If there is a cache whose xmin is aborted xid, delete the cache
- If there is a cache whose xmax is aborted xid, initialize xmax information
- At commit time, there is no action to the shared cache

Pending items are
- thoughts about shared relcache
- "vacuum" process for shared cache

Regards,
Ideriha Takeshi





RE: Copy data to DSA area

2019-05-14 Thread Ideriha, Takeshi
Hi, 

>>From: Thomas Munro [mailto:thomas.mu...@gmail.com]
>>Subject: Re: Copy data to DSA area
>>
>>On Wed, May 8, 2019 at 5:29 PM Ideriha, Takeshi
>>
>>wrote:
>>> >From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>>> >Sent: Friday, April 26, 2019 11:50 PM Well, after developing PoC, I
>>> >realized that this PoC doesn't solve the local process is crashed
>>> >before the context becomes shared because local process keeps track
>>> >of pointer to chunks.
>>> >Maybe all of you have already noticed and pointed out this case :)
>>> >So it needs another work but this poc is a good step for me to advance 
>>> >more.
>>>
>>> I think the point to prevent memory leak is allocating memory and
>>> storing its address into a structure at the same time. This structure
>>> should be trackable from other process.
>>
>>I'm not sure that it's necessarily wrong to keep tracking information in 
>>private
>memory.
>>If any backend crashes, the postmaster will terminate all backends and
>>reinitialise everything anyway, because shared memory might be corrupted.
>
>I'm going to put keep tracking information in private
>memory and send a patch.

I updated a PoC patch.
This has memory tracking buffer in local process. The old version also has this 
system but I refactored the code. Memory leak while allocating memory seems to 
be solved thanks to memory tracking buffer.

What I haven't addressed is memory leak while freeing objects. In current
sequence a cache (e.g. relcache) is freed after removed from its hash table. 
If cache and hash table gets shared, memory leak is likely to happen
between removal from hash table and free. We lose track of cache objects
if error happens after cache is unlinked from the hash table. And also a cache
consists of graph structure. So we also take care of freeing cache partially.

Maybe we need to remember pointers of objects before unlink from the hash.
Also, we need to free them all at once after we can make sure that all the 
pointers are registered to local buffer. Followings are some idea to implement
this: 
- change the order of removal from hash table and deletion
- pfree in shared memory context doesn't dsa_free but just add pointer to
  the local buffer. 
- remove link from hash table after all pfree() is done
- then, call a function, which does actual dsa_free taking a look at the local
  Buffer

But I'm not sure this solution is good one. Do you have any thoughts?

Regards,
Takeshi Ideriha
/*
 * Another PoC of MemoryContext for DSA based on Thomus Munro's work
 *
 * https://www.postgresql.org/message-id/
 * CAEepm%3D22H0TDCAHWh%3DYWmSV%2BX%2BbTtcTNg8RpP%3DeaCjWJU_d-9A
 * %40mail.gmail.com
 *
 */

#include "postgres.h"

#include "fmgr.h"
#include "lib/ilist.h"
#include "miscadmin.h"
#include "nodes/pg_list.h"
#include "nodes/memnodes.h"
#include "storage/ipc.h"
#include "storage/lwlock.h"
#include "storage/shmem.h"
#include "utils/dsa.h"
#include "utils/memutils.h"

#define MY_AREA_SIZE (1024 * 1024)
#define NUM_CHUNKS 64 /* an arbitrary number */


PG_MODULE_MAGIC;

void _PG_init(void);
PG_FUNCTION_INFO_V1(hoge);
PG_FUNCTION_INFO_V1(hoge_list);
PG_FUNCTION_INFO_V1(hoge_list_error);

#define isLocalContext(c) (c->alloc_buffer != NULL)
#define isBufferFull(buf) (buf->tail == NUM_CHUNKS)

/* Support code to make a dsa_area into a MemoryContext. */
typedef struct dsa_temp_buffer dsa_temp_buffer;

static void hoge_shmem_startup_hook(void);
static MemoryContext make_hoge_memory_context(dsa_area *area, void *base, bool 
isLocal);
static void push_new_buffer(dsa_temp_buffer *buffer);

static shmem_startup_hook_type prev_shmem_startup_hook;
static void *my_raw_memory;
static dsa_area *my_area;
static MemoryContext my_shared_dsa_context;
static MemoryContext my_local_dsa_context;

static List **my_list;
void AddToDsaSharedContext(MemoryContext local, MemoryContext shared);
MemoryContext CreateDsaLocalContext(MemoryContext shared);
Size ShmDsaContextSize(void);


void
_PG_init(void)
{
/* This only works if preloaded by the postmaster. */
if (!process_shared_preload_libraries_in_progress)
return;

/* Request a chunk of traditional shared memory. */
RequestAddinShmemSpace(MY_AREA_SIZE);

/* Register our hook for phase II of initialization. */
prev_shmem_startup_hook = shmem_startup_hook;
shmem_startup_hook = hoge_shmem_startup_hook;
}

static void
hoge_shmem_startup_hook(void)
{
MemoryContext   old_context;
boolfound;

if (prev_shmem_startup_hook)
prev_shmem_startup_hook();

old_context = MemoryContextSwitchTo(TopMemoryContext);

   

RE: Copy data to DSA area

2019-05-08 Thread Ideriha, Takeshi
Hi, Thomas 

>-Original Message-
>From: Thomas Munro [mailto:thomas.mu...@gmail.com]
>Subject: Re: Copy data to DSA area
>
>On Wed, May 8, 2019 at 5:29 PM Ideriha, Takeshi 
>
>wrote:
>> >From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>> >Sent: Friday, April 26, 2019 11:50 PM Well, after developing PoC, I
>> >realized that this PoC doesn't solve the local process is crashed
>> >before the context becomes shared because local process keeps track
>> >of pointer to chunks.
>> >Maybe all of you have already noticed and pointed out this case :) So
>> >it needs another work but this poc is a good step for me to advance more.
>>
>> I think the point to prevent memory leak is allocating memory and
>> storing its address into a structure at the same time. This structure
>> should be trackable from other process.
>
>I'm not sure that it's necessarily wrong to keep tracking information in 
>private memory.
>If any backend crashes, the postmaster will terminate all backends and 
>reinitialise
>everything anyway, because shared memory might be corrupted.

Thank you very much for the clarification. I haven't looked into reinitialize 
sequence
so much. I checked CreateSharedMemoryAndSemaphores is called again and shared
memory gets initialized. So I'm going to put keep tracking information in 
private 
memory and send a patch.

Regards,
Takeshi Ideriha



RE: Copy data to DSA area

2019-05-07 Thread Ideriha, Takeshi
Hi, 

>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>Sent: Friday, April 26, 2019 11:50 PM
>Well, after developing PoC, I realized that this PoC doesn't solve the local 
>process is
>crashed before the context becomes shared because local process keeps track of
>pointer to chunks.
>Maybe all of you have already noticed and pointed out this case :) So it needs 
>another
>work but this poc is a good step for me to advance more.

I think the point to prevent memory leak is allocating memory and storing its 
address into a structure at the same time. This structure should be trackable 
from
other process. 

So I'm going to change the design. I'll allocate a buffer of pointers to 
dsa objects which are not permanent yet. This buffer is located on shared 
memory. For simplicity, the buffer is allocated per process. That is the 
number of backends equals to MaxBackends. 

Developer calls API to make objects permanent. If objects become permanent, 
corresponding pointers are deleted from the buffer. If they fail to become 
permanent, they are freed from shared memory by checking the buffer and 
corresponding pointers also deleted from the buffer. 

There are two cases of failure: one is transaction abort, the other is process
crash. If transaction aborts, that process itself has responsibility to free
the objects. The process free the objects. In case of process crash, another 
backend needs to take care of it. I'm assuming on_shmem_exit callback can be 
used to free objects by postmaster. 

Regarding MemoryContext, one Shared MemoryContext is on the shared memory
and each chunk has a back pointer to it to pfree and so on.

I'm going to make a new patch and send it later.

Regards,
Takeshi Ideriha






RE: Copy data to DSA area

2019-04-26 Thread Ideriha, Takeshi
Hi, I've updated Thomas's quick PoC.

>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>Sent: Wednesday, April 17, 2019 2:07 PM
>>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>>Sent: Wednesday, December 5, 2018 2:42 PM
>>Subject: RE: Copy data to DSA area
>
>Things under discussion:
>- How we prevent memory leak
>- How we prevent dangling pointer after cleaning up about-to-leak-objects
>
>Regarding memory leak, I think Robert's idea that allocate objects under 
>temporal
>context while building and re-parent it to permanent one at some point is 
>promising.
>While building objects they are under temporal DSA-MemoryContext, which is 
>child of
>TopTransactionContext (if it's in the transaction) and are freed all at once 
>when error
>happens.
>To do delete all the chunks allocated under temporal DSA context, we need to 
>search
>or remember all chunks location under the context. Unlike AllocAset we don't 
>have
>block information to delete them altogether.
>
>So I'm thinking to manage dsa_allocated chunks with single linked list to keep 
>track of
>them and delete them.
>The context has head of linked list and all chunks have pointer to next 
>allocated chunk.
>But this way adds space overhead to every dsa_allocated chunk and we maybe want
>to avoid it because shared memory size is limited.
>In this case, we can free these pointer area at some point when we make sure 
>that
>allocation is successful.
>
>Another debate is when we should think the allocation is successful (when we 
>make
>sure object won't leak).
>If allocation is done in the transaction, we think if transaction is committed 
>we can
>think it's safe.
>Or I assume this DSA memory context for cache such as relcache, catcache, 
>plancache
>and so on.
>In this case cache won't leak once it's added to hash table or list because I 
>assume
>some eviction mechanism like LRU will be implemented and it will erase useless 
>cache
>some time later.

I changed design from my last email. 
I've introduced "local" and "shared" MemoryContext for DSA. 
Local means MemoryContext object is allocated on local heap and shared means on 
shared memory.
While dsa_allocating, operation should be done on local context and after 
developer thinks
they don't leak, set local context to shared one. 

Dsa_pointer to chunks is stored in the array linked by local context.
When error happens before chunks become "shared",  all chunks are freed by 
checking the array.
On the other hand, chunk gets "committed" and become shared, we don't need that 
array of pointers.

This PoC has three test cases, which is updated version of Thomas's ones.
- hoge() : palloc(), pfree() then set the context shared and do it again
- hoge_list(): add chunk to shared single linked list and set context to shared
- hoge_list_error(): add chunk to linked list but error happens before it 
becomes shared one, 
so free the chunk to prevent memory leak

Well, after developing PoC, I realized that this PoC doesn't solve the local 
process is crashed before
the context becomes shared because local process keeps track of pointer to 
chunks. 
Maybe all of you have already noticed and pointed out this case :) 
So it needs another work but this poc is a good step for me to advance more.

Another thing is that I don't want put backpointer to MemoryContext before 
every chunks
since it has some overhead in limited shared memory. But pfree uses it so I 
compromised it.
And when set local context to shared one, I need to change every backpointer to 
shared MemoryContext
so it has some cost.

I think there is more efficient way.
(Maybe Robert mentioned it in previous email?)

>Regarding dangling pointer I think it's also problem.
>After cleaning up objects to prevent memory leak we don't have mechanism to 
>reset
>dangling pointer.

I haven't addressed the dangling pointer yet.
Actually hoge_list() issued after hoge_list_error() is executed leads backends 
crash.
That's because it seems to me that allocated chunk is freed after error 
but the tail pointer of shared linked list is not recovered.
It becomes dangling pointer.
So this would be a good example of dangling pointer.

Regards,
Takeshi Ideriha


PoC-DSA-MemoryContext-extension.tar.gz
Description: PoC-DSA-MemoryContext-extension.tar.gz


RE: Global shared meta cache

2019-04-19 Thread Ideriha, Takeshi
>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>Do you have any thoughts?
>
Hi, I updated my idea, hoping get some feedback.

[TL; DR]
The basic idea is following 4 points:
A. User can choose which database to put a cache (relation and catalog) on 
shared memory and how much memory is used
B. Caches of committed data are on the shared memory. Caches of uncommitted 
data are on the local memory.
C. Caches on the shared memory have xid information (xmin, xmax)
D. Evict not recently used cache from shared memory


[A]
Regarding point A, I can imagine some databases are connected by lots of 
clients but others don't.
So I introduced a new parameter in postgresql.conf, "shared_meta_cache", 
which is disabled by default and needs server restart to enable.
ex. shared_meta_cache = 'db1:500MB, db2:100MB'. 

Some catcaches like pg_database are shared among the whole database, 
so such shared catcaches are allocated in a dedicated space within shared 
memory. 
This space can be controlled by "shared_meta_global_catcache" parameter, which 
is named after global directory.
But I want this parameter to be hidden in postgresql.conf to make it simple for 
users. It's too detailed.

[B & C]
Regarding B & C, the motivation is we don't want other backends to see 
uncommitted tables.
Search order is local memory -> shared memory -> disk. 
Local process searches cache in shared memory based from its own snapshot and 
xid of cache. 
When cache is not found in shared memory, cache with xmin is made in shared 
memory ( but not in local one).

When cache definition is changed by DDL, new cache is created in local one, and 
thus next commands refer to local cache if needed. 
When it's committed, local cache is cleared and shared cache is updated. This 
update is done by adding xmax to old cache
and also make a new one with xmin. The idea behind adding a new one is that 
newly created cache (new table or altered table)
is likely to be used in next transactions. At this point maybe we can make use 
of current invalidation mechanism, 
even though invalidation message to other backends is not sent. 

[D]
As for D, I'm thinking to do benchmark with simple LRU. If the performance is 
bad, change to other algorithm like Clock.
We don't care about eviction of local cache because its lifetime is in a 
transaction, and I don't want to make it bloat.

best regards,
Takeshi Ideriha





RE: Copy data to DSA area

2019-04-16 Thread Ideriha, Takeshi
>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>Sent: Wednesday, December 5, 2018 2:42 PM
>Subject: RE: Copy data to DSA area

Hi
It's been a long while since we discussed this topic. 
Let me recap first and I'll give some thoughts.

It seems things we got consensus is:

- Want to palloc/pfree transparently in DSA
- Use Postgres-initialized shared memory as DSA 
- Don’t leak memory in shared memory 

Things under discussion:
- How we prevent memory leak
- How we prevent dangling pointer after cleaning up about-to-leak-objects 

Regarding memory leak, I think Robert's idea that allocate objects under 
temporal context 
while building and re-parent it to permanent one at some point is promising. 
While building objects they are under temporal DSA-MemoryContext, which is
child of TopTransactionContext (if it's in the transaction) and are freed all 
at once when error happens.
To do delete all the chunks allocated under temporal DSA context, we need to 
search
or remember all chunks location under the context. Unlike AllocAset we don't 
have block information
to delete them altogether.

So I'm thinking to manage dsa_allocated chunks with single linked list to keep 
track of them and delete them.
The context has head of linked list and all chunks have pointer to next 
allocated chunk.
But this way adds space overhead to every dsa_allocated chunk and we maybe want 
to avoid it because shared memory size is limited.
In this case, we can free these pointer area at some point when we make sure 
that allocation is successful.

Another debate is when we should think the allocation is successful (when we 
make sure object won't leak). 
If allocation is done in the transaction, we think if transaction is committed 
we can think it's safe.
Or I assume this DSA memory context for cache such as relcache, catcache, 
plancache and so on.
In this case cache won't leak once it's added to hash table or list because I 
assume some eviction mechanism like LRU will be implemented
and it will erase useless cache some time later.

What do you think about these ideas?

Regarding dangling pointer I think it's also problem. 
After cleaning up objects to prevent memory leak we don't have mechanism to 
reset dangling pointer.
On this point I gave some thoughts while ago though begin_allocate/end_allocate 
don't seem good names. 
Maybe more explaining names are like 
start_pointing_to_dsa_object_under_construction() and 
end_pointing_to_dsa_object_under_construction(). 
https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A6F1F259F%40G01JPEXMBKW04
 
If we make sure that such dangling pointer never happen, we don't need to use 
it.
As Thomas mentioned before, where these interface should be put needs review 
but couldn't hit upon another solution right now. 

Do you have some thoughts?

best regards,
Ideriha, Takeshi





RE: Protect syscache from bloating with negative cache entries

2019-04-11 Thread Ideriha, Takeshi
>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>Does this result show that hard-limit size option with memory accounting 
>doesn't harm
>to usual users who disable hard limit size option?

Hi, 

I've implemented relation cache size limitation with LRU list and built-in 
memory context size account.
And I'll share some results coupled with a quick recap of catcache so that we 
can resume discussion if needed
though relation cache bloat was also discussed in this thread but right now 
it's pending 
and catcache feature is not fixed. But a variety of information could be good I 
believe.

Regarding catcache it seems to me recent Horiguchi san posts shows a pretty 
detailed stats
including comparison LRU overhead and full scan of hash table. According to the 
result, lru overhead seems small
but for simplicity this thread go without LRU.
https://www.postgresql.org/message-id/20190404.215255.09756748.horiguchi.kyotaro%40lab.ntt.co.jp

When there was hard limit of catcach, there was built-in memory context size 
accounting machinery.
I checked the overhead of memory accounting, and when repeating palloc and 
pfree of 800 byte area many times it was 4% down
on the other hand in case of 32768 byte there seems no overhead.
https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A6F44564E%40G01JPEXMBKW04
 


Regarding relcache hard limit (relation_cache_max_size), most of the 
architecture was similar as catcache one with LRU list except memory accounting.
Relcaches are managed by LRU list. To prune LRU cache, we need to know overall 
relcache sizes including objects pointed by relcache 
like 'index info'.
So in this patch relcache objects are allocated under RelCacheMemoryContext, 
which is child of CacheMemoryContext. Objects pointed by
relcache is allocated under child context of RelCacheMemoryContext.
In built-in size accounting, if memoryContext is set to collect "group(family) 
size", you can get context size including child easily.

I ran two experiments:
A) One is pgbench using Tomas's script he posted while ago, which is randomly 
select 1 from many tables.
https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A6F426207%40G01JPEXMBKW04

B) The other is to check memory context account overhead using the same method.
https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A6F44564E%40G01JPEXMBKW04
 

A) randomly select 1 from many tables
Results are average of 5 times each.

number of tables| 100  |1000|1
---
TPS (master)|11105  |10815 |8915
TPS (patch; limit feature off)  |11254 (+1%) |11176 (+3%) |9242 (+4%)
TPS (patch: limit on with 1MB)  |11317 (+2%) |10491 (-3%) |7380 (-17%)

The results are noisy but it seems overhead of LRU and memory accounting is 
small when turning off the relcache limit feature.
When turning on the limit feature, after exceeding the limit it drops 17%, 
which is no surprise.


B) Repeat palloc/pfree
"With group accounting" means that account test context and its child context 
with built-in accounting using "palloc_bench_family()".
The other one is that using palloc_bench(). Please see palloc_bench.gz.

[Size=32768, iter=1,000,000]
Master| 59.97 ms
Master with group account | 59.57 ms
patched   |67.23 ms
patched with family   |68.81 ms

It seems that overhead seems large in this patch. So it needs more inspection 
this area.


regards,
Takeshi Ideriha




palloc_bench.gz
Description: palloc_bench.gz


v15-0001-4-Add-dlist_move_tail.patch
Description: v15-0001-4-Add-dlist_move_tail.patch


binj9L3vPCahv.bin
Description: v15-0002-4-ideriha-Memory-consumption-report-reature-of-memorycontext.patch


v15-0004-4-ideriha-Remove-RelCache-Entries.patch
Description: v15-0004-4-ideriha-Remove-RelCache-Entries.patch


v15-0003-4-ideriha-Remove-CatCache-Entries.patch
Description: v15-0003-4-ideriha-Remove-CatCache-Entries.patch


RE: idle-in-transaction timeout error does not give a hint

2019-03-31 Thread Ideriha, Takeshi
>From: Tatsuo Ishii [mailto:is...@sraoss.co.jp]
>
>> Personally, I don't find this hint particularly necessary.  The
>> session was terminated because nothing was happening, so the real fix
>> on the application side is probably more involved than just retrying.
>> This is different from some of the other cases that were cited, such
>> as serialization conflicts, where you just got unlucky due to
>> concurrent events.  In the case of idle-in-transaction-timeout, the
>> fault is entirely your own.
>
>Hum. Sounds like a fair argument. Ideriha-san, what do you think?
>

Hi.
As Peter mentioned, application code generally needs to handle more things 
than retrying.
So I'm ok with not adding this hint.

Regards,
Takeshi Ideriha






RE: Protect syscache from bloating with negative cache entries

2019-03-24 Thread Ideriha, Takeshi

>From: Vladimir Sitnikov [mailto:sitnikov.vladi...@gmail.com]
>
>Robert> This email thread is really short on clear demonstrations that X
>Robert> or Y is useful.
>
>It is useful when the whole database does **not** crash, isn't it?
>
>Case A (==current PostgeSQL mode): syscache grows, then OOMkiller chimes in, 
>kills
>the database process, and it leads to the complete cluster failure (all other 
>PG
>processes terminate themselves).
>
>Case B (==limit syscache by 10MiB or whatever as Tsunakawa, Takayuki
>asks):  a single ill-behaved process works a bit slower and/or consumers more 
>CPU
>than the other ones. The whole DB is still alive.
>
>I'm quite sure "case B" is much better for the end users and for the database
>administrators.
>
>So, +1 to Tsunakawa, Takayuki, it would be so great if there was a way to 
>limit the
>memory consumption of a single process (e.g. syscache, workmem, etc, etc).
>
>Robert> However, memory usage is quite unpredictable.  It depends on how
>Robert> many backends are active
>
>The number of backends can be limited by ensuring a proper limits at 
>application
>connection pool level and/or pgbouncer and/or things like that.
>
>Robert>how many copies of work_mem and/or  maintenance_work_mem are in
>Robert>use
>
>There might be other patches to cap the total use of
>work_mem/maintenance_work_mem,
>
>Robert>I don't think we
>Robert> can say that just imposing a limit on the size of the system
>Robert>caches is  going to be enough to reliably prevent an out of
>Robert>memory condition
>
>The less possibilities there are for OOM the better. Quite often it is much 
>better to fail
>a single SQL rather than kill all the DB processes.

Yeah, I agree. This limit would be useful for such extreme situation. 

Regards,
Takeshi Ideriha


RE: Protect syscache from bloating with negative cache entries

2019-03-07 Thread Ideriha, Takeshi
>From: Robert Haas [mailto:robertmh...@gmail.com]
>On Thu, Mar 7, 2019 at 9:49 AM Tomas Vondra 
>wrote:
>> I don't think this shows any regression, but perhaps we should do a
>> microbenchmark isolating the syscache entirely?
>
>Well, if we need the LRU list, then yeah I think a microbenchmark would be a 
>good idea
>to make sure we really understand what the impact of that is going to be.  But 
>if we
>don't need it and can just remove it then we don't.

Just to be sure, we introduced the LRU list in this thread to find the entries 
less than threshold time
without scanning the whole hash table. If hash table becomes large without LRU 
list, scanning time becomes slow.

Regards,
Takeshi Ideriha


RE: Protect syscache from bloating with negative cache entries

2019-03-03 Thread Ideriha, Takeshi
>From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com]

>> [Size=800, iter=1,000,000]
>> Master |15.763
>> Patched|16.262 (+3%)
>>
>> [Size=32768, iter=1,000,000]
>> Master |61.3076
>> Patched|62.9566 (+2%)
>
>What's the unit, second or millisecond?
Millisecond.

>Why is the number of digits to the right of the decimal point?
>
>Is the measurement correct?  I'm wondering because the difference is larger in 
>the
>latter case.  Isn't the accounting processing almost the same in both cases?
>* former: 16.262 - 15.763 = 4.99
>* latter: 62.956 - 61.307 = 16.49
>I think the overhead is sufficiently small.  It may get even smaller with a 
>trivial tweak.
>
>You added the new member usedspace at the end of MemoryContextData.  The
>original size of MemoryContextData is 72 bytes, and Intel Xeon's cache line is 
>64 bytes.
>So, the new member will be on a separate cache line.  Try putting usedspace 
>before
>the name member.

OK. I changed the order of MemoryContextData members to fit usedspace into one 
cacheline.
I disabled all the catcache eviction mechanism in patched one and compared it 
with master
to investigate that overhead of memory accounting become small enough. 

The settings are almost same as the last email. 
But last time the number of trials was 50 so I increased it and tried 5000 
times to 
calculate the average figure (rounded off to three decimal place).
 [Size=800, iter=1,000,000]
  Master |15.64 ms
  Patched|16.26 ms (+4%)
  The difference is  0.62ms

 [Size=32768, iter=1,000,000]
  Master |61.39 ms
  Patched|60.99 ms (-1%)
  
I guess there is around 2% noise.
But based on this experiment it seems the overhead small.
Still there is some overhead but it can be covered by some other 
manipulation like malloc().

Does this result show that hard-limit size option with memory accounting 
doesn't harm to usual users who disable hard limit size option?

Regards,
Takeshi Ideriha


v15-0001-3-Add-dlist_move_tail.patch
Description: v15-0001-3-Add-dlist_move_tail.patch


binuIZWS0c0ho.bin
Description: v15-0002-3-ideriha-Memory-consumption-report-reature-of-memorycontext.patch


v15-0003-3-ideriha-Remove-CatCache-Entries.patch
Description: v15-0003-3-ideriha-Remove-CatCache-Entries.patch


RE: Protect syscache from bloating with negative cache entries

2019-02-28 Thread Ideriha, Takeshi
>From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com]
>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>> I measured the memory context accounting overhead using Tomas's tool
>> palloc_bench, which he made it a while ago in the similar discussion.
>> https://www.postgresql.org/message-id/53f7e83c.3020...@fuzzy.cz
>>
>> This tool is a little bit outdated so I fixed it but basically I
>> followed him.
>> Things I did:
>> - make one MemoryContext
>> - run both palloc() and pfree() for 32kB area 1,000,000 times.
>> - And measure this time

>And are you sure you didn't enable assert checking?
Ah, sorry.. I misconfigured it. 

>I'm afraid the measurement is not correct.  First, the older discussion below 
>shows
>that the accounting overhead is much, much smaller, even with a more complex
>accounting.
>Second, allocation/free of memory > 8 KB calls malloc()/free().  I guess the
>accounting overhead will be more likely to be hidden under the overhead of 
>malloc()
>and free().  What we'd like to know the overhead when malloc() and free() are 
>not
>called.

Here is the average of 50 times measurement. 
Palloc-pfree for 800byte with 1,000,000 times, and 32kB with 1,000,000 times.
I checked malloc is not called at size=800 using gdb.

[Size=800, iter=1,000,000]
Master |15.763
Patched|16.262 (+3%)

[Size=32768, iter=1,000,000]
Master |61.3076
Patched|62.9566 (+2%)

At least compared to previous HashAg version, the overhead is smaller.
It has some overhead but is increase by 2 or 3% a little bit?

Regards,
Takeshi Ideriha


RE: Protect syscache from bloating with negative cache entries

2019-02-27 Thread Ideriha, Takeshi
>From: Robert Haas [mailto:robertmh...@gmail.com]
>
>On Mon, Feb 25, 2019 at 3:50 AM Tsunakawa, Takayuki
> wrote:
>> How can I make sure that this context won't exceed, say, 10 MB to avoid OOM?
>
>As Tom has said before and will probably say again, I don't think you actually 
>want that.
>We know that PostgreSQL gets roughly 100x slower with the system caches 
>disabled
>- try running with CLOBBER_CACHE_ALWAYS.  If you are accessing the same system
>cache entries repeatedly in a loop - which is not at all an unlikely scenario, 
>just run the
>same query or sequence of queries in a loop - and if the number of entries 
>exceeds
>10MB even, perhaps especially, by just a tiny bit, you are going to see a 
>massive
>performance hit.
>Maybe it won't be 100x because some more-commonly-used entries will always stay
>cached, but it's going to be really big, I think.
>
>Now you could say - well it's still better than running out of memory.
>However, memory usage is quite unpredictable.  It depends on how many backends
>are active and how many copies of work_mem and/or maintenance_work_mem are in
>use, among other things.  I don't think we can say that just imposing a limit 
>on the
>size of the system caches is going to be enough to reliably prevent an out of 
>memory
>condition unless the other use of memory on the machine happens to be extremely
>stable.

>So I think what's going to happen if you try to impose a hard-limit on the 
>size of the
>system cache is that you will cause some workloads to slow down by 3x or more
>without actually preventing out of memory conditions.  What you need to do is 
>accept
>that system caches need to grow as big as they need to grow, and if that 
>causes you
>to run out of memory, either buy more memory or reduce the number of concurrent
>sessions you allow.  It would be fine to instead limit the cache memory if 
>those cache
>entries only had a mild effect on performance, but I don't think that's the 
>case.


I'm afraid I may be quibbling about it.
What about users who understand performance drops but don't want to 
add memory or decrease concurrency?
I think that PostgreSQL has a parameter
which most of users don't mind and use is as default 
but a few of users want to change it.
In this case as you said, introducing hard limit parameter causes
performance decrease significantly so how about adding detailed caution
to the document like planner cost parameter?

Regards,
Takeshi Ideriha


RE: Protect syscache from bloating with negative cache entries

2019-02-26 Thread Ideriha, Takeshi
>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>>>* 0001: add dlist_push_tail() ... as is
>>>* 0002: memory accounting, with correction based on feedback
>>>* 0003: merge the original 0003 and 0005, with correction based on
>>>feedback
>>
>>Attached are simpler version based on Horiguchi san's ver15 patch,
>>which means cache is pruned by both time and size.
>>(Still cleanup function is complex but it gets much simpler.)
>
>I don't mean to disregard what Horiguchi san and others have developed and
>discussed.
>But I refactored again the v15 patch to reduce complexity of v15 patch because 
>it
>seems to me one of the reason for dropping feature for pruning by size stems 
>from
>code complexity.
>
>Another thing is there's been discussed about over memory accounting overhead 
>but
>the overhead effect hasn't been measured in this thread. So I'd like to 
>measure it.

I measured the memory context accounting overhead using Tomas's tool 
palloc_bench, 
which he made it a while ago in the similar discussion.
https://www.postgresql.org/message-id/53f7e83c.3020...@fuzzy.cz 

This tool is a little bit outdated so I fixed it but basically I followed him.
Things I did:
- make one MemoryContext
- run both palloc() and pfree() for 32kB area 1,000,000 times. 
- And measure this time 

The result shows that master is 30 times faster than patched one.
So as Andres mentioned in upper thread it seems it has overhead.

[master (without v15 patch)]
61.52 ms
60.96 ms
61.40 ms
61.42 ms
61.14 ms

[with v15 patch]
1838.02 ms
1754.84 ms
1755.83 ms
1789.69 ms
1789.44 ms

Regards,
Takeshi Ideriha


RE: Protect syscache from bloating with negative cache entries

2019-02-25 Thread Ideriha, Takeshi
>>From: Tsunakawa, Takayuki
>>Ideriha-san,
>>Could you try simplifying the v15 patch set to see how simple the code
>>would look or not?  That is:
>>
>>* 0001: add dlist_push_tail() ... as is
>>* 0002: memory accounting, with correction based on feedback
>>* 0003: merge the original 0003 and 0005, with correction based on
>>feedback
>
>Attached are simpler version based on Horiguchi san's ver15 patch, which means
>cache is pruned by both time and size.
>(Still cleanup function is complex but it gets much simpler.)

I don't mean to disregard what Horiguchi san and others have developed and 
discussed. 
But I refactored again the v15 patch to reduce complexity of v15 patch
because it seems to me one of the reason for dropping feature for pruning by 
size stems from
code complexity.

Another thing is there's been discussed about over memory accounting overhead 
but
the overhead effect hasn't been measured in this thread. So I'd like to measure 
it.

Regards,
Takeshi Ideriha


v15-0001-2-Add-dlist_move_tail.patch
Description: v15-0001-2-Add-dlist_move_tail.patch


bin7aPjzkMJvE.bin
Description: v15-0002-2-ideriha-Memory-consumption-report-reature-of-memorycontext.patch


v15-0003-2-ideriha-Remove-CatCache-Entries.patch
Description: v15-0003-2-ideriha-Remove-CatCache-Entries.patch


RE: Protect syscache from bloating with negative cache entries

2019-02-22 Thread Ideriha, Takeshi
>From: Tsunakawa, Takayuki
>Ideriha-san,
>Could you try simplifying the v15 patch set to see how simple the code would 
>look or
>not?  That is:
>
>* 0001: add dlist_push_tail() ... as is
>* 0002: memory accounting, with correction based on feedback
>* 0003: merge the original 0003 and 0005, with correction based on feedback

Attached are simpler version based on Horiguchi san's ver15 patch, 
which means cache is pruned by both time and size.
(Still cleanup function is complex but it gets much simpler.)

Regards,
Takeshi Ideriha


v15-0003-ideriha-Remove-CatCache-Entries.patch
Description: v15-0003-ideriha-Remove-CatCache-Entries.patch


binBLOINvJH0o.bin
Description: v15-0002-ideriha-Memory-consumption-report-reature-of-memorycontext.patch


v15-0001-Add-dlist_move_tail.patch
Description: v15-0001-Add-dlist_move_tail.patch


RE: Protect syscache from bloating with negative cache entries

2019-02-20 Thread Ideriha, Takeshi
>From: Tsunakawa, Takayuki
>>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>> number of tables   | 100  |1000|1
>> ---
>> TPS (master)   |10966  |10654 |9099
>> TPS (patch)| 11137 (+1%) |10710 (+0%) |772 (-91%)
>>
>> It seems that before cache exceeding the limit (no pruning at 100 and
>> 1000), the results are almost same with master but after exceeding the
>> limit (at
>> 1)
>> the decline happens.
>
>How many concurrent clients?
One client (default setting). 

>Can you show the perf's call graph sampling profiles of both the unpatched and
>patched version, to confirm that the bottleneck is around catcache eviction 
>and refill?

I checked it with perf record -avg and perf report. 
The following shows top 20 symbols during benchmark including kernel space.
The main difference between master (unpatched) and patched one seems that
patched one consumes cpu catcache-evict-and-refill functions including 
SearchCatCacheMiss(),  CatalogCacheCreateEntry(), CatCacheCleanupOldEntries().
So it seems to me that these functions needs further inspection 
to suppress the performace decline as much as possible 

Master(%) master|patch (%)  patch
51.25%  cpu_startup_entry   |   51.45%  cpu_startup_entry
51.13%  arch_cpu_idle   |   51.19%  arch_cpu_idle
51.13%  default_idle|   51.19%  default_idle
51.13%  native_safe_halt|   50.95%  native_safe_halt
36.27%  PostmasterMain  |   46.98%  PostmasterMain
36.27%  main|   46.98%  main
36.27%  __libc_start_main   |   46.98%  __libc_start_main
36.07%  ServerLoop  |   46.93%  ServerLoop
35.75%  PostgresMain|   46.89%  PostgresMain
26.03%  exec_simple_query   |   45.99%  exec_simple_query
26.00%  rest_init   |   43.40%  SearchCatCacheMiss
26.00%  start_kernel|   42.80%  CatalogCacheCreateEntry
26.00%  x86_64_start_reservations   |   42.75%  
CatCacheCleanupOldEntries
26.00%  x86_64_start_kernel |   27.04%  rest_init
25.26%  start_secondary |   27.04%  start_kernel
10.25%  pg_plan_queries |   27.04%  x86_64_start_reservations
10.17%  pg_plan_query   |   27.04%  x86_64_start_kernel
10.16%  main|   24.42%  start_secondary
10.16%  __libc_start_main   |   22.35%  pg_analyze_and_rewrite
10.03%  standard_planner|   22.35%  parse_analyze

Regards,
Takeshi Ideriha



RE: Protect syscache from bloating with negative cache entries

2019-02-19 Thread Ideriha, Takeshi
>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>But at the same time, I did some benchmark with only hard limit option enabled 
>and
>time-related option disabled, because the figures of this case are not 
>provided in this
>thread.
>So let me share it.

I'm sorry but I'm taking back result about patch and correcting it.
I configured postgresql (master) with only CFLAGS=O2
but I misconfigured postgres (path applied) with 
--enable-cassert --enable-debug --enable-tap-tests 'CFLAGS=-O0'.
These debug option (especially --enable-cassert) caused enourmous overhead.
(I thought I checked the configure option.. I was maybe tired.)
So I changed these to only 'CFLAGS=-O2' and re-measured them.

>I did two experiments. One is to show negative cache bloat is suppressed.
>This thread originated from the issue that negative cache of pg_statistics is 
>bloating as
>creating and dropping temp table is repeatedly executed.
>https://www.postgresql.org/message-id/20161219.201505.11562604.horiguchi.kyot
>aro%40lab.ntt.co.jp
>Using the script attached the first email in this thread, I repeated create 
>and drop
>temp table at 1 times.
>(experiment is repeated 5 times. catalog_cache_max_size = 500kB.
> compared master branch and patch with hard memory limit)
>
>Here are TPS and CacheMemoryContext 'used' memory (total - freespace) 
>calculated
>by MemoryContextPrintStats() at 100, 1000, 1 times of create-and-drop
>transaction. The result shows cache bloating is suppressed after exceeding the 
>limit
>(at 1) but tps declines regardless of the limit.
>
>number of tx (create and drop)   | 100  |1000|1
>---
>used CacheMemoryContext  (master) |610296|2029256 |15909024 used
>CacheMemoryContext  (patch)  |755176|880552  |880592
>---
>TPS (master) |414   |407 |399
>TPS (patch)   |242   |225 |220

Correct one:
number of tx (create and drop)   | 100  |1000|1
---
TPS (master) |414   |407 |399
TPS (patch)   |447   |415 |409

The results between master and patch is almost same.


>Another experiment is using Tomas's script posted while ago, The scenario is 
>do select
>1 from multiple tables randomly (uniform distribution).
>(experiment is repeated 5 times. catalog_cache_max_size = 10MB.
> compared master branch and patch with only hard memory limit enabled)
>
>Before doing the benchmark, I checked pruning is happened only at 1 tables 
>using
>debug option. The result shows degradation regardless of before or after 
>pruning.
>I personally still need hard size limitation but I'm surprised that the 
>difference is so
>significant.
>
>number of tables   | 100  |1000|1
>---
>TPS (master)   |10966  |10654 |9099
>TPS (patch)|4491   |2099 |378

Correct one:
number of tables   | 100  |1000|1
---
TPS (master)   |10966  |10654 |9099
TPS (patch)| 11137 (+1%) |10710 (+0%) |772 (-91%)

It seems that before cache exceeding the limit (no pruning at 100 and 1000),
the results are almost same with master but after exceeding the limit (at 
1) 
the decline happens.


Regards,
Takeshi Ideriha



RE: Protect syscache from bloating with negative cache entries

2019-02-15 Thread Ideriha, Takeshi
>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]

>>About the new global-size based evicition(2), cache entry creation
>>becomes slow after the total size reached to the limit since every one
>>new entry evicts one or more old (=
>>not-recently-used) entries. Because of not needing knbos for each
>>cache, it become far realistic. So I added documentation of
>"catalog_cache_max_size" in 0005.
>
>Now I'm also trying to benchmark, which will be posted in another email.

According to recent comments by Andres and Bruce
maybe we should address negative cache bloat step by step 
for example by reviewing Tom's patch. 

But at the same time, I did some benchmark with only hard limit option enabled
and time-related option disabled, because the figures of this case are not 
provided in this thread.
So let me share it.

I did two experiments. One is to show negative cache bloat is suppressed.
This thread originated from the issue that negative cache of pg_statistics 
is bloating as creating and dropping temp table is repeatedly executed.
https://www.postgresql.org/message-id/20161219.201505.11562604.horiguchi.kyotaro%40lab.ntt.co.jp
  
Using the script attached the first email in this thread, I repeated create and 
drop temp table at 1 times.
(experiment is repeated 5 times. catalog_cache_max_size = 500kB. 
 compared master branch and patch with hard memory limit)

Here are TPS and CacheMemoryContext 'used' memory (total - freespace) 
calculated by MemoryContextPrintStats()
at 100, 1000, 1 times of create-and-drop transaction. The result shows 
cache bloating is suppressed
after exceeding the limit (at 1) but tps declines regardless of the limit.

number of tx (create and drop)   | 100  |1000|1 
---
used CacheMemoryContext  (master) |610296|2029256 |15909024
used CacheMemoryContext  (patch)  |755176|880552  |880592
---
TPS (master) |414   |407 |399
TPS (patch)   |242   |225 |220


Another experiment is using Tomas's script posted while ago,
The scenario is do select 1 from multiple tables randomly (uniform 
distribution).
(experiment is repeated 5 times. catalog_cache_max_size = 10MB. 
 compared master branch and patch with only hard memory limit enabled)

Before doing the benchmark, I checked pruning is happened only at 1 tables
using debug option. The result shows degradation regardless of before or after 
pruning. 
I personally still need hard size limitation but I'm surprised that the 
difference is so significant.

number of tables   | 100  |1000|1 
---
TPS (master)   |10966  |10654 |9099
TPS (patch)|4491   |2099 |378

Regards,
Takeshi Ideriha




RE: Protect syscache from bloating with negative cache entries

2019-02-14 Thread Ideriha, Takeshi
>From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
>
>
>(2) Another new patch v15-0005 on top of previous design of
>  limit-by-number-of-a-cache feature converts it to
>  limit-by-size-on-all-caches feature, which I think is
>  Tsunakawa-san wanted.
Yeah, size looks better to me.

>As far as I can see no significant degradation is found in usual (as long as 
>pruning
>doesn't happen) code paths.
>
>About the new global-size based evicition(2), cache entry creation becomes 
>slow after
>the total size reached to the limit since every one new entry evicts one or 
>more old (=
>not-recently-used) entries. Because of not needing knbos for each cache, it 
>become
>far realistic. So I added documentation of "catalog_cache_max_size" in 0005.

Now I'm also trying to benchmark, which will be posted in another email.

Here are things I noticed:

[1] compiler warning
catcache.c:109:1: warning: missing braces around initializer [-Wmissing-braces]
 dlist_head cc_lru_list = {0};
 ^
catcache.c:109:1: warning: (near initialization for ‘cc_lru_list.head’) 
[-Wmissing-braces]

[2] catalog_cache_max_size is not appered in postgresql.conf.sample

[3] global lru list and global size can be included in CatCacheHeader, which 
seems to me 
good place because this structure contains global cache information 
regardless of kind of CatCache 

[4] when applying patch with git am, there are several warnings about trailing 
white space at v15-0003

Regards,
Takeshi Ideriha



RE: performance issue in remove_from_unowned_list()

2019-02-08 Thread Ideriha, Takeshi
>From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com]
>But it's a bit funnier, because there's also DropRelationFiles() which does 
>smgrclose on
>a batch of relations too, and it says this
>
>/*
> * Call smgrclose() in reverse order as when smgropen() is called.
> * This trick enables remove_from_unowned_list() in smgrclose()
> * to search the SMgrRelation from the unowned list,
> * with O(1) performance.
> */
>for (i = ndelrels - 1; i >= 0; i--)
>...
>
>but it's called from two places in xact.c only. And if you trigger the issue 
>with 100k x
>CREATE TABLE + ROLLBACK, and then force a recovery by killing postmaster, you
>actually get the very same behavior with always traversing the unowned list 
>for some
>reason. I'm not quite sure why, but it seems the optimization is exactly the 
>wrong thing
>to do here.

So when DropRelationFiles() is called, order of calling smgr_close() and order 
of unowed list is always same?

This one was inroduced at b4166911 and I'd like to hear author and reviewer's 
opinion. 
https://www.postgresql.org/message-id/CAHGQGwH0hwXwrCDnmUU2Twj5YgHcrmMCVD7O%3D1NrRTpHcbtCBw%40mail.gmail.com

Regards,
Takeshi Ideriha


RE: Protect syscache from bloating with negative cache entries

2019-02-08 Thread Ideriha, Takeshi
>From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
>I made a rerun of benchmark using "-S -T 30" on the server build with no 
>assertion and
>-O2. The numbers are the best of three successive attempts.  The patched 
>version is
>running with cache_target_memory = 0, cache_prune_min_age = 600 and
>cache_entry_limit = 0 but pruning doesn't happen by the workload.
>
>master: 13393 tps
>v12   : 12625 tps (-6%)
>
>Significant degradation is found.
>
>Recuded frequency of dlist_move_tail by taking 1ms interval between two 
>succesive
>updates on the same entry let the degradation dissapear.
>
>patched  : 13720 tps (+2%)

It would be good to introduce some interval.
I followed your benchmark (initialized scale factor=10 and others are same 
option) 
and found the same tendency. 
These are average of 5 trials.
master:   7640.000538 
patch_v12:7417.981378 (3 % down against master)
patch_v13:7645.071787 (almost same as master)

These cases are not pruning happen workload as you mentioned.
I'd like to do benchmark of cache-pruning-case as well.
To demonstrate cache-pruning-case
right now I'm making hundreds of partitioned table and run select query for 
each partitioned table
using pgbench custom file. Maybe using small number of cache_prune_min_age or 
hard limit would be better.
Are there any good model?

># I'm not sure the name LRU_IGNORANCE_INTERVAL makes sens..
How about MIN_LRU_UPDATE_INTERVAL?

Regards,
Takeshi Ideriha




RE: Protect syscache from bloating with negative cache entries

2019-02-07 Thread Ideriha, Takeshi
Hi, thanks for recent rapid work. 

>From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
>At Tue, 5 Feb 2019 19:05:26 -0300, Alvaro Herrera 
>wrote in <20190205220526.GA1442@alvherre.pgsql>
>> On 2019-Feb-05, Tomas Vondra wrote:
>>
>> > I don't think we need to remove the expired entries right away, if
>> > there are only very few of them. The cleanup requires walking the
>> > hash table, which means significant fixed cost. So if there are only
>> > few expired entries (say, less than 25% of the cache), we can just
>> > leave them around and clean them if we happen to stumble on them
>> > (although that may not be possible with dynahash, which has no
>> > concept of expiration) of before enlarging the hash table.
>>
>> I think seqscanning the hash table is going to be too slow;
>> Ideriha-san idea of having a dlist with the entries in LRU order
>> (where each entry is moved to head of list when it is touched) seemed
>> good: it allows you to evict older ones when the time comes, without
>> having to scan the rest of the entries.  Having a dlist means two more
>> pointers on each cache entry AFAIR, so it's not a huge amount of memory.
>
>Ah, I had a separate list in my mind. Sounds reasonable to have pointers in 
>cache entry.
>But I'm not sure how much additional
>dlist_* impact.

Thank you for picking up my comment, Alvaro.
That's what I was thinking about.

>The attached is the new version with the following properties:
>
>- Both prune-by-age and hard limiting feature.
>  (Merged into single function, single scan)
>  Debug tracking feature in CatCacheCleanupOldEntries is removed
>  since it no longer runs a full scan.
It seems to me that adding hard limit strategy choice besides prune-by-age one 
is good
to help variety of (contradictory) cases which have been discussed in this 
thread. I need hard limit as well.

The hard limit is currently represented as number of cache entry 
controlled by both cache_entry_limit and cache_entry_limit_prune_ratio. 
Why don't we change it to the amount of memory (bytes)? 
Amount of memory is more direct parameter for customer who wants to
set the hard limit and easier to tune compared to number of cache entry.

>- Using LRU to get rid of full scan.
>
>I added new API dlist_move_to_tail which was needed to construct LRU.

I just thought there is dlist_move_head() so if new entries are 
head side and old ones are tail side. But that's not objection to adding 
new API because depending on the situation head for new entry could be readable 
code 
and vice versa. 

Regards,
Takeshi Ideriha




RE: Cache relation sizes?

2019-02-05 Thread Ideriha, Takeshi
>From: Jamison, Kirk [mailto:k.jami...@jp.fujitsu.com]

>On the other hand, the simplest method I thought that could also work is to 
>only cache
>the file size (nblock) in shared memory, not in the backend process, since 
>both nblock
>and relsize_change_counter are uint32 data type anyway. If 
>relsize_change_counter
>can be changed without lock, then nblock can be changed without lock, is it 
>right? In
>that case, nblock can be accessed directly in shared memory. In this case, is 
>the
>relation size necessary to be cached in backend?

(Aside from which idea is better.. )
If you want to put relation size on the shared memory, then I don't think 
caches in backend 
is necessary because every time relation_size is updated you need to invalidate 
cache 
in backends. At the reference taking shared lock on the cache and at the update 
taking 
exclusive lock is simple without backend cache. 

>(2) Is the MdSharedData temporary or permanent in shared memory?
That data structure seems to initialize at the time of InitPostgre, which means 
it's permanent
because postgres-initialized-shared-memory doesn't have a chance to drop it as 
far as I know.
(If you want to use temporary data structure, then other mechanism like 
dsm/dsa/dshash is a candidate.)

Regards,
Takeshi Ideriha


RE: Protect syscache from bloating with negative cache entries

2019-02-04 Thread Ideriha, Takeshi
>From: br...@momjian.us [mailto:br...@momjian.us]
>On Mon, Feb 4, 2019 at 08:23:39AM +, Tsunakawa, Takayuki wrote:
>> Horiguchi-san, Bruce, all, So, why don't we make
>> syscache_memory_target the upper limit on the total size of all
>> catcaches, and rethink the past LRU management?
>
>I was going to say that our experience with LRU has been that the overhead is 
>not
>worth the value, but that was in shared resource cases, which this is not.

One idea is building list with access counter for implementing LRU list based 
on this current patch.
The list is ordered by last access time. When a catcache entry is referenced, 
the list is maintained
, which is just manipulation of pointers at several times.
As Bruce mentioned, it's not shared so there is no cost related to lock 
contention.

When it comes to pruning, the cache older than certain timestamp with zero 
access counter is pruned.
This way would improve performance because it only scans limited range (bounded 
by sys_cache_min_age).  
Current patch scans all hash entries and check each timestamp which would 
decrease the performance as cache size grows.
I'm thinking hopefully implementing this idea and measuring the performance. 

And when we want to set the memory size limit as Tsunakawa san said, the LRU 
list would be suitable.

Regards,
Takeshi Ideriha




RE: Protect syscache from bloating with negative cache entries

2019-01-29 Thread Ideriha, Takeshi
Hi

>> > I suggest you go with just syscache_prune_min_age, get that into PG
>> > 12, and we can then reevaluate what we need.  If you want to
>> > hard-code a minimum cache size where no pruning will happen, maybe
>> > based on the system catalogs or typical load, that is fine.
>>
>> Please forgive me if I say something silly (I might have got lost.)
>>
>> Are you suggesting to make the cache size limit system-defined and 
>> uncontrollable
>by the user?  I think it's necessary for the DBA to be able to control the 
>cache memory
>amount.  Otherwise, if many concurrent connections access many partitions 
>within a
>not-so-long duration, then the cache eviction can't catch up and ends up in 
>OOM.
>How about the following questions I asked in my previous mail?
>
>cache_memory_target does the opposit of limiting memory usage. It keeps some
>amount of syscahe entries unpruned. It is intended for sessions on where
>cache-effective queries runs intermittently.
>syscache_prune_min_age also doesn't directly limit the size. It just eventually
>prevents infinite memory consumption.
>
>The knobs are not no-brainer at all and don't need tuning in most cases.
>
>> --
>> This is a pure question.  How can we answer these questions from users?
>>
>> * What value can I set to cache_memory_target when I can use 10 GB for the
>caches and max_connections = 100?
>> * How much RAM do I need to have for the caches when I set 
>> cache_memory_target
>= 1M?
>>
>> The user tends to estimate memory to avoid OOM.
>> --
>
>You don't have a direct control on syscache memory usage. When you find a 
>queriy
>slowed by the default cache expiration, you can set cache_memory_taret to keep
>them for intermittent execution of a query, or you can increase
>syscache_prune_min_age to allow cache live for a longer time.
>


In current ver8 patch there is a stats view representing age class distribution.
https://www.postgresql.org/message-id/20181019.173457.68080786.horiguchi.kyotaro%40lab.ntt.co.jp
Does it help DBA with tuning cache_prune_age and/or cache_prune_target?
If the amount of cache entries of older age class is large, are people supposed 
to lower prune_age and 
not to change cache_prune_target?
(I get confusion a little bit.)

Regards,
Takeshi Ideriha




RE: Localization tools for Postgres

2018-12-20 Thread Ideriha, Takeshi
Hi
cc’ed pgsql-translators

I personally uses ‘poedit’ when working around po files.


Takeshi Ideriha
Fujitsu Limited

From: Дмитрий Воронин [mailto:carriingfat...@yandex.ru]
Sent: Thursday, December 20, 2018 12:54 AM
To: pgsql-hack...@postgresql.org
Subject: Localization tools for Postgres

Hi, hackers.

Can you tell me which tools do you prefer when you update *.po translations?

Thanks!


RE: Protect syscache from bloating with negative cache entries

2018-12-19 Thread Ideriha, Takeshi
>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>The detailed comments for the source code will be provided later.

Hi, 
I'm adding some comments to 0001 and 0002 one.

[0001 patch]

+   /*  


+* Calculate the duration from the time of the last 
access to the   

+* "current" time. Since catcacheclock is not advanced 
within a
 
+* transaction, the entries that are accessed within 
the current 
   
+* transaction won't be pruned. 


+*/ 


+   TimestampDifference(ct->lastaccess, catcacheclock, 
_age, ); 


+   /*  
  
+* Try to remove entries older than cache_prune_min_age 
seconds.  

+   if (entry_age > cache_prune_min_age) 


Can you change this comparison between entry_age and cache_prune_min_age 
to "entry_age >= cache_prune_min_age"?
That is, I want the feature that entries that are accessed even within the 
transaction 
is pruned in case of cache_prune_min_age = 0 
I can hit upon some of my customer who want to always keep memory usage below 
certain limit as strictly as possible.
This kind of strict user would set cache_prune_min_age to 0 and would not want 
to exceed the memory target even
if within a transaction.

As I put miscellaneous comments about 0001 patch in some previous email, so 
please take a look at it.

[0002 patch]
I haven't looked into every detail but here are some comments.

Maybe you would also need to add some sentences to this page:
https://www.postgresql.org/docs/current/monitoring-stats.html

+pgstat_get_syscache_stats(PG_FUNCTION_ARGS) 
Function name like 'pg_stat_XXX' would match surrounding code. 

When applying patch I found trailing whitespace warning:
../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:157:
 trailing whitespace.

../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:256:
 trailing whitespace.

../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:301:
 trailing whitespace.

../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:483:
 trailing whitespace.

../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:539:
 trailing whitespace.


Regards,
Takeshi Ideriha




RE: Protect syscache from bloating with negative cache entries

2018-12-19 Thread Ideriha, Takeshi
Hello,

Sorry for delay. 
The detailed comments for the source code will be provided later.

>> I just thought that the pair of ageclass and nentries can be
>> represented as json or multi-dimensional array but in virtual they are
>> all same and can be converted each other using some functions. So I'm not 
>> sure
>which representaion is better one.
>
>Multi dimentional array in any style sounds reasonable. Maybe array is 
>preferable in
>system views as it is a basic type than JSON. In the attached, it looks like 
>the follows:
>
>=# select * from pg_stat_syscache  where ntuples > 100; -[ RECORD
>1 ]--
>pid | 1817
>relname | pg_class
>cache_name  | pg_class_oid_index
>size| 2048
>ntuples | 189
>searches| 1620
>hits| 1431
>neg_hits| 0
>ageclass| {{30,189},{60,0},{600,0},{1200,0},{1800,0},{0,0}}
>last_update | 2018-11-27 19:22:00.74026+09

Thanks, cool. That seems better to me.

>
>> >3. non-transactional GUC setting (in 0003)
>> >
>> >It allows setting GUC variable set by the action
>> >GUC_ACTION_NONXACT(the name requires condieration) survive beyond
>> >rollback. It is required by remote guc setting to work sanely.
>> >Without the feature a remote-set value within a trasction will
>> >disappear involved in rollback. The only local interface for the
>> >NONXACT action is set_config(name, value, is_local=false, is_nonxact = 
>> >true).
>pg_set_backend_guc() below works on this feature.
>>
>> TBH, I'm not familiar with around this and I may be missing something.
>> In order to change the other backend's GUC value, is ignoring
>> transactional behevior always necessary? When transaction of GUC
>> setting is failed and rollbacked, if the error message is supposeed to
>> be reported I thought just trying the transaction again is enough.
>
>The target backend can be running frequent transaction.  The invoker backend 
>cannot
>know whether the remote change happend during a transaction and whether the
>transaction if any is committed or aborted, no error message sent to invoker 
>backend.
>We could wait for the end of a trasaction but that doesn't work with long 
>transactions.
>
>Maybe we don't need the feature in GUC system but adding another similar 
>feature
>doesn't seem reasonable. This would be useful for some other tracking features.

Thank you for the clarification. 


>> >4. pg_set_backend_guc() function.
>> >
>> >Of course syscache statistics recording consumes significant amount
>> >of time so it cannot be turned on usually. On the other hand since
>> >this feature is turned on by GUC, it is needed to grab the active
>> >client connection to turn on/off the feature(but we cannot). Instead, I 
>> >provided a
>means to change GUC variables in another backend.
>> >
>> >pg_set_backend_guc(pid, name, value) sets the GUC variable "name"
>> >on the backend "pid" to "value".
>> >
>> >
>> >
>> >With the above tools, we can inspect catcache statistics of seemingly 
>> >bloated
>process.
>> >
>> >A. Find a bloated process pid using ps or something.
>> >
>> >B. Turn on syscache stats on the process.
>> >  =# select pg_set_backend_guc(9984, 'track_syscache_usage_interval',
>> >'1');
>> >
>> >C. Examine the statitics.
>> >
>> >=# select pid, relname, cache_name, size from pg_stat_syscache order
>> >by size desc limit 3;
>> > pid  |   relname|cache_name|   size
>> >--+--+--+--
>> > 9984 | pg_statistic | pg_statistic_relid_att_inh_index | 32154112
>> > 9984 | pg_cast  | pg_cast_source_target_index  | 4096
>> > 9984 | pg_operator  | pg_operator_oprname_l_r_n_index  | 4096
>> >
>> >
>> >=# select * from pg_stat_syscache where cache_name =
>> >'pg_statistic_relid_att_inh_index'::regclass;
>> >-[ RECORD 1 ]-
>> >pid | 9984
>> >relname | pg_statistic
>> >cache_name  | pg_statistic_relid_att_inh_index
>> >size| 11026176
>> >ntuples | 77950
>> >searches| 77950
>> >hits| 0
>> >neg_hits| 0
>> >ageclass| {30,60,600,1200,1800,0}
>> >nentries| {17630,16950,43370,0,0,0}
>> >last_update | 2018-10-17 15:58:19.738164+09
>>
>> The output of this view seems good to me.
>>
>> I can imagine this use case. Does the use case of setting GUC locally never 
>> happen?
>> I mean can the setting be locally changed?
>
>Syscahe grows through a life of a backend/session. No other client cannot 
>connect to
>it at the same time. So the variable must be set at the start of a backend 
>using ALTER
>USER/DATABASE, or the client itself is obliged to deliberitely turn on the 
>feature at a
>convenient time. I suppose that in most use cases one wants to turn on this 
>feature
>after he sees another session is eating memory more and more.
>
>The attached is the rebased version that has multidimentional ageclass.

Thank you! That's convenient.
How about splitting this non-xact guc 

RE: Copy data to DSA area

2018-12-04 Thread Ideriha, Takeshi
Hi, thank you for the comment.

>-Original Message-
>From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
>Sent: Wednesday, November 28, 2018 5:09 PM
>
>Hello.
>
>At Wed, 28 Nov 2018 05:13:26 +0000, "Ideriha, Takeshi"
> wrote in
><4E72940DA2BF16479384A86D54D0988A6F3BD73A@G01JPEXMBKW04>
>> Hi
>>
>> >From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com]
>> >Here we need something better, because once this new kind of memory
>> >fills up with leaked objects, you have to restart the whole cluster to 
>> >recover.
>>
>> Thank you for the clarification. I agree that leaving memory leaking
>> in shared memory without freeing it ends up cluster-wide problem.
>
>If some of the "leaked" objects are liked each other with other "live" objects 
>including
>those on local memory, we must tell whether each "maybe-leaked" object is 
>actually
>a leak or not. That looks apprently stupid. Maybe we shouldn't manipulate 
>objects on
>shared memory at such a low-level interface. We would need a kind of resource
>manager for such structures holding shared objects and should (perhaps 
>explicitly)
>register the objects that shuold be free'd on error-exiting from a scope. This 
>seems a
>bit different from reset callbacks of the current MemoryContext.

Sure. We want to get rid of dangling pointer at the error handling but this 
should be 
transparent. 


>> >If the solution involves significantly different control flows (like
>> >PG_TRY(),
>> >PG_FINALLY() manual cleanup all over the place), then lots of
>> >existing palloc-based code will need to be reviewed and rewritten
>> >very carefully for use in this new kind of memory context.  If it can
>> >be done automagically, then more of our existing code will be able to
>> >participate in the construction of stuff that we want to put in shared 
>> >memory.
>>
>> About resetting X->a I was thinking about how to treat dangling
>> pointer inside a new memory context machinery instead of using
>> PG_TRY() stuff. That's because putting PG_TRY() all over the place
>> becomes responsible for a developer trying to shared-memory-version
>> MemoryContext and it needs a lot of efforts as you noted. But in my mind 
>> even if
>PG_TRY() is used, it only affects new code uses shared-memory version
>MemoryContext and doesn't affect current code.
>> I may be missing something here...
>
>Thomas mentioned exiting upper-layer library code using palloc, like pg_list.  
>As I
>wrote above, pg_list is one of such feature that would need to be carefully 
>adjusted.

Thank you for the clarification. I think I was misunderstanding. In some 
previous email 
I suggested "points_to_pallocd_data(pointer, new_data)" to remember old_value 
and 
do reset a pointer at the time of error handling. Yeah, this interface needs to 
be implemented 
at existing function wrapping palloc and set the pointer.

>> >I first thought about this in the context of shared plan caches, a
>> >topic that appears on this mailing list from time to time.  There are
>> >some other fun problems, like cache invalidation for shared caches,
>> >space recycling (LRUs etc), and of course locking/concurrency
>> >(dsa_allocate() and dsa_free() are concurrency safe, but whatever
>> >data structures you build with the memory of course need their own plan for
>dealing with concurrency).
>> >But a strategy for clean-up on errors seems to be a major subproblem
>> >to deal with early in the project.  I will be very happy if we can
>> >come up with something :-)
>>
>> Yeah, I hope we can do it somehow.
>
>In addition to recycling issue within a fixed area, I suppose that we will 
>need to
>"expand" the fixed-address shared memory. PROT_NONE could work for that but it
>must break some platforms..
If we are only talking about DSA created in the place of Postgres-initailized 
shared area,
I think we can prevent memory from expanding using dsa_set_limit_size().

>> Thank you for kind remark.
>> By the way I just thought meta variable "hoge" is used only in Japan
>> :)
>
>I've seen it many-many times including mine in this list, thus why don't you 
>know it is
>already a world-wide jargon? :p

Ah I didn't know that, thanks!

Regards,
Takeshi Ideriha




RE: idle-in-transaction timeout error does not give a hint

2018-12-03 Thread Ideriha, Takeshi
>From: Tatsuo Ishii [mailto:is...@sraoss.co.jp]
>Subject: Re: idle-in-transaction timeout error does not give a hint
>
>>Alternative HINT message would be something like:
>>
>>HINT: In a moment you should be able to reconnect to the
>>  database and restart your transaction.
>>
>>This could make the meaning of the error (transaction aborted)
>>cleaner and might give a better suggestion to the user.
>
> Agreed. Changing "command" to "transaction" seems more accurate.
> People might think only the command they hit is not sent but
> transaction is still alive though it's of course unnatural that
> transaction is alive after
>>>connection is terminated.
>>
>>
>>>Please find attached patch which addresses the point above.
>> Thank you for the update. It looks good to me on this point.
>> Are you planning to change other similar messages?
>
>No, because this is the only message related to an explicit transaction.  Other
>messages are not related to explicit transactions, so current messages look ok 
>to me.

Sure. I didn't have a strong opinion about it, so it's ok.

Regards,
Takeshi Ideriha




RE: idle-in-transaction timeout error does not give a hint

2018-12-02 Thread Ideriha, Takeshi
>From: Tatsuo Ishii [mailto:is...@sraoss.co.jp]
>
>Hi Ideriha-san,
>
> Hi, it makes sense to me. One can submit transaction again same as
> other cases you mentioned.
>
> I didn't attach the patch but according to my simple experiment in
> psql the output would become the following:
>
> FATAL:  terminating connection due to idle-in-transaction timeout
> HINT: In a moment you should be able to reconnect to the
>   database and repeat your command.

Alternative HINT message would be something like:

HINT: In a moment you should be able to reconnect to the
  database and restart your transaction.

This could make the meaning of the error (transaction aborted)
cleaner and might give a better suggestion to the user.
>>>
>>> Agreed. Changing "command" to "transaction" seems more accurate.
>>> People might think only the command they hit is not sent but
>>> transaction is still alive though it's of course unnatural that transaction 
>>> is alive after
>connection is terminated.


>Please find attached patch which addresses the point above.
Thank you for the update. It looks good to me on this point. 
Are you planning to change other similar messages?

>BTW, would you like to be added to the CF item as a reviewer?
Yeah, I've already added myself as a review.

Regards,
Takeshi Ideriha




minor typo in dsa.c

2018-11-28 Thread Ideriha, Takeshi
Hi.
I found a minor typo in dsa.c.
s/set_dsa_size_limit/dsa_set_size_limit/

regards,
==
Takeshi Ideriha
Fujitsu Limited



0001-Minor-typo-in-dsa.c.patch
Description: 0001-Minor-typo-in-dsa.c.patch


RE: idle-in-transaction timeout error does not give a hint

2018-11-28 Thread Ideriha, Takeshi
>> Hi, it makes sense to me. One can submit transaction again same as
>> other cases you mentioned.
>>
>> I didn't attach the patch but according to my simple experiment in
>> psql the output would become the following:
>>
>> FATAL:  terminating connection due to idle-in-transaction timeout
>> HINT: In a moment you should be able to reconnect to the
>>   database and repeat your command.
>
>Alternative HINT message would be something like:
>
>HINT: In a moment you should be able to reconnect to the
>  database and restart your transaction.
>
>This could make the meaning of the error (transaction aborted) cleaner and 
>might give
>a better suggestion to the user.

Agreed. Changing "command" to "transaction" seems more accurate. People might 
think
only the command they hit is not sent but transaction is still alive though 
it's of course unnatural
that transaction is alive after connection is terminated.

In this case you could change the comment issued by other errors mentioned 
while you're at it.

Regards,
Takeshi Ideriha




RE: idle-in-transaction timeout error does not give a hint

2018-11-27 Thread Ideriha, Takeshi
>From: Tatsuo Ishii [mailto:is...@sraoss.co.jp]
>Sent: Wednesday, November 28, 2018 12:18 PM
>To: pgsql-hackers@lists.postgresql.org
>Subject: idle-in-transaction timeout error does not give a hint
>
>idle-in-transaction timeout error closed the session. I think in this case the 
>error
>message should give a hint something like other errors (for example
>ERRCODE_CRASH_SHUTDOWN or
>ERRCODE_T_R_SERIALIZATION_FAILURE) to ask users to reconnect.
>Attached patch does that.

Hi, it makes sense to me. One can submit transaction again same as other cases
you mentioned. 

I didn't attach the patch but according to my simple experiment
in psql the output would become the following:

FATAL:  terminating connection due to idle-in-transaction timeout
HINT: In a moment you should be able to reconnect to the
  database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

Regards,
Takeshi Ideriha




RE: Copy data to DSA area

2018-11-27 Thread Ideriha, Takeshi
Hi

>From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com]
>Sent: Wednesday, November 14, 2018 9:50 AM
>To: Ideriha, Takeshi/出利葉 健 
>
>On Tue, Nov 13, 2018 at 10:59 PM Ideriha, Takeshi 
>
>wrote:
>> Can I check my understanding?
>> The situation you are talking about is the following:
>> Data structure A and B will be allocated on DSA. Data A has a pointer to B.
>> Another data X is already allocated on some area (might be on local
>> heap) and has a pointer variable X->a, which will be linked to A.
>>
>>  old_context = MemoryContextSwitchTo(dsa_memory_context);
>>  A = palloc();
>>  B = palloc();
>>  A->b = B;
>>  X->a = A;
>>  MemoryContextSwitchTo(old_context);
>>
>> If error happens in the way of this flow, palloc'd data (A & B) should
>> be freed and pointer to freed data (X->a) should be back to its original one.
>> So handling these cases introduces an "transaction" API like 
>> begin_allocate() and
>end_allocate().
>
>I wasn't thinking of resetting X->a to its original value, just freeing A and 
>B.  For a
>shared cache in this memory area, you need to make sure that you never leak
>memory; it must either be referenced by the cache book keeping data structures 
>so
>that you can find it and manually free it eventually (probably with an LRU 
>system?), or
>it must somehow be cleaned up if an error occurs before you manage to insert 
>it into
>the cache.  During construction of an object graph, before you attach it to 
>your
>manual book keeping data structures, there is a danger zone.
>
>For example, what if, while copying a query plan or a catcache entry, we 
>successfully
>allocate a new cache entry with palloc(), but then another palloc() call fails 
>because
>we run out of memory when copying the keys?  It looks like the current 
>catcache.c
>coding just doesn't
>care: memory will be leaked permanently in CacheMemoryContext.  That's OK
>because it is process-local.  Such a process is probably doomed anyway.  If 
>you exit
>and then start a new backend the problem is gone.
>Here we need something better, because once this new kind of memory fills up 
>with
>leaked objects, you have to restart the whole cluster to recover.

Thank you for the clarification. I agree that leaving memory leaking in shared 
memory 
without freeing it ends up cluster-wide problem. 

>If the solution involves significantly different control flows (like PG_TRY(),
>PG_FINALLY() manual cleanup all over the place), then lots of existing 
>palloc-based
>code will need to be reviewed and rewritten very carefully for use in this new 
>kind of
>memory context.  If it can be done automagically, then more of our existing 
>code will
>be able to participate in the construction of stuff that we want to put in 
>shared
>memory.

About resetting X->a I was thinking about how to treat dangling pointer inside 
a new memory
context machinery instead of using PG_TRY() stuff. That's because putting 
PG_TRY() all over the 
place becomes responsible for a developer trying to shared-memory-version 
MemoryContext and 
it needs a lot of efforts as you noted. But in my mind even if PG_TRY() is 
used, it only affects new
code uses shared-memory version MemoryContext and doesn't affect current code.
I may be missing something here...

>I first thought about this in the context of shared plan caches, a topic that 
>appears on
>this mailing list from time to time.  There are some other fun problems, like 
>cache
>invalidation for shared caches, space recycling (LRUs etc), and of course
>locking/concurrency
>(dsa_allocate() and dsa_free() are concurrency safe, but whatever data 
>structures
>you build with the memory of course need their own plan for dealing with 
>concurrency).
>But a strategy for clean-up on errors seems to be a major subproblem to deal 
>with
>early in the project.  I will be very happy if we can come up with something 
>:-)

Yeah, I hope we can do it somehow.


>On Wed, Nov 14, 2018 at 12:40 AM Ideriha, Takeshi 
>
>wrote:
>> From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com]
>> >postgres=# call hoge_list(3);
>> >NOTICE:  Contents of list:
>> >NOTICE:   1
>> >NOTICE:   2
>> >NOTICE:   3
>> >CALL
>> >
>> >You can call that procedure from various different backends to add
>> >and remove integers from a List.  The point is that it could just as
>> >easily be a query plan or any other palloc-based stuff in our tree, and the 
>> >pointers
>work in any backend.
>>
>> Thanks for the patch. I know it's a rough sketch but basically that's what I 
>> want to
>do.
>> I tried it and it works well.
>
>After your message I couldn't resist a small experiment.  But it seems there 
>are
>plenty more problems to be solved to make it useful...

Thank you for kind remark. 
By the way I just thought meta variable "hoge" is used only in Japan :)

Regards,
Takeshi Ideriha


RE: Protect syscache from bloating with negative cache entries

2018-11-27 Thread Ideriha, Takeshi
>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>I haven't looked into the code but I'm going to do it later.

Hi, I've taken a look at 0001 patch. Reviewing the rest of patch will be later.

if (!IsParallelWorker())
  
+   {   
  
stmtStartTimestamp = GetCurrentTimestamp(); 
  
+   
  
+   /* Set this timestamp as aproximated current time */
  
+   SetCatCacheClock(stmtStartTimestamp);   
  
+   }   
  
else
  
Just confirmation.
At first I thought that when parallel worker is active catcacheclock is not 
updated.
But when parallel worker is active catcacheclock is updated by the parent and 
no problem is occurred.

+   int tupsize = 0;

   


   
/* negative entries have no tuple associated */ 

   
if (ntp)

   
{   

   
int i;  

   
+   int tupsize;

+   ct->size = tupsize;

@@ -1906,17 +2051,24 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, 
Datum *arguments,
ct->dead = false;   

   
ct->negative = negative;

   
ct->hash_value = hashValue; 

   
+   ct->naccess = 0;

   
+   ct->lastaccess = catcacheclock; 

   
+   ct->size = tupsize;

tupsize is declared twice inside and outiside of if scope but it doesn't seem 
you need to do so. 
And ct->size = tupsize is executed twice at if block and outside of if-else 
block.

+static inline TimestampTz  

   
+GetCatCacheClock(void) 

   

This function is not called by anyone in this version of patch. In previous 
version, this one is called by plancache.
Will further patch focus only on catcache? In this case this one can be 
removed. 

There are some typos.
+   int size;   /* palloc'ed size off 
this tuple */ 
typo: off->of

+   /* Set this timestamp as aproximated current time */
typo: aproximated->approximated

+ * GUC variable to define the minimum size of hash to cosider entry eviction.
typo: cosider -> consider

+   /* initilize catcache reference clock if haven't done yet */ 
typo:initilize -> initialize

Regards,
Takeshi Ideriha


RE: Global shared meta cache

2018-11-26 Thread Ideriha, Takeshi
Hi,

>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>Sent: Wednesday, October 3, 2018 3:18 PM
>At this moment this patch only allocates catalog cache header and CatCache 
>data on
>the shared memory area.
On this allocation stuffs I'm trying to handle it in another thread [1] in a 
broader way.

>The followings are major items I haven't touched:
>- how to treat cache visibility and invalidation coming from transactions 
>including DDL

On this point some of you gave me comment before but at that time I had less 
knowledge 
and couldn't replay them immediately. Sorry for that.

Right now I hit upon two things.
Plan A is that all of the works is done in the shared memory and no local cache 
is used.
Plan B is that both shared cache and local cache are used.
Maybe based on the discussion several month ago in this thread, plan B would be 
better.
But there are some variations of plan B so I'd like to hear opinions. 

A. Use only shared memory
Because everything should be done inside shared memory it needs same machinery 
as current DB shared_buffers
That is, handling transaction including DDL in a proper way needs MVCC and 
cleaning up obsoleted cache needs vacuum.
Taking advantage of MVCC and vacuum would work but it seems to me pretty tough 
to implement them.
So another option is plan B, which handles version control of cache and clean 
them up in a different way.

B. Use both shared memory and local memory
Basic policy is that the shared memory keeps the latest version cache as much 
as possible and each cache has version information (xmin, xmax).
Local cache is a kind of cache of shared one and its lifetime is temporal.

[Search cache]
When a backend wants to use relation or catalog cache in a transaction, it 
tries to find them in a following order:
1. local cache
2. shared cache
3. disk

At first there is no local cache so it tries to search shared cache and if 
found loads it into the local memory.
If wanted cache is not found in shared memory, backend fetches it from disk. 

[Lifetime of local cache]
When ALTER TABLE/DROP TABLE is issued in a transaction, relevant local cache 
should be different from the original one.
On this point I'm thinking two cases.
B-1: Create a local cache at the first reference and keep it until transaction 
ends.
 The relevant local cache is updated or deleted when the DROP/ALTER is 
issued. It's freed when transaction is committed or aborted.
B-2: The lifetime of local cache is during one snapshot. If isolation-level is 
read-committed, every time the command is issued local cache is deleted.

In case of B-1 sinval messages machinery is necessary to update the local 
cache, which is same as current machinery.
On the other hand, case B-2 doesn't need sinval message because after one 
snapshot duration is expired the local cache is deleted.
From another point of view, there is trade-off relation between B-1 and B-2. 
B-1 would outweigh B-2 in terms of performance 
but B-2 would use less memory.

[Invalidation of shared cache]
I'm thinking that invalidating shared cache can be responsible for a backend 
which wants to see the latest version rather than
one has committed DROP/ALTER command. In my sketch caches has its own version 
information so transaction can compare its snapshot 
with shared cache version and if cache is not wanted one, we can obtain it from 
disk.

Do you have any thoughts?

[1] 
https://www.postgresql.org/message-id/flat/4E72940DA2BF16479384A86D54D0988A6F1EE452%40G01JPEXMBKW04
 
Regards,
Takeshi Ideriha



RE: Protect syscache from bloating with negative cache entries

2018-11-15 Thread Ideriha, Takeshi
Hello, thank you for updating the patch.


>From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
>At Thu, 4 Oct 2018 04:27:04 +0000, "Ideriha, Takeshi"
> wrote in
><4E72940DA2BF16479384A86D54D0988A6F1BCB6F@G01JPEXMBKW04>
>> >As a *PoC*, in the attached patch (which applies to current master),
>> >size of CTups are counted as the catcache size.
>> >
>> >It also provides pg_catcache_size system view just to give a rough
>> >idea of how such view looks. I'll consider more on that but do you have any 
>> >opinion
>on this?
>> >
>...
>> Great! I like this view.
>> One of the extreme idea would be adding all the members printed by
>> CatCachePrintStats(), which is only enabled with -DCATCACHE_STATS at this
>moment.
>> All of the members seems too much for customers who tries to change
>> the cache limit size But it may be some of the members are useful
>> because for example cc_hits would indicate that current cache limit size is 
>> too small.
>
>The attached introduces four features below. (But the features on relcache and
>plancache are omitted).
I haven't looked into the code but I'm going to do it later.

Right now It seems to me that focusing on catalog cache invalidation and its 
stats a quick route
to commit this feature. 

>1. syscache stats collector (in 0002)
>
>Records syscache status consists of the same columns above and "ageclass"
>information. We could somehow triggering a stats report with signal but we 
>don't want
>take/send/write the statistics in signal handler. Instead, it is turned on by 
>setting
>track_syscache_usage_interval to a positive number in milliseconds.

I agreed. Agecalss is important to tweak the prune_min_age. 
Collecting stats is heavy at every stats change

>2. pg_stat_syscache view.  (in 0002)
>
>This view shows catcache statistics. Statistics is taken only on the backends 
>where
>syscache tracking is active.
>
>>  pid  | application_name |relname |cache_name
>|   size   |ageclass | nentries
>>
>--+--++---+--
>+-+---
>>  9984 | psql | pg_statistic   | pg_statistic_relid_att_inh_index 
>>  |
>12676096 | {30,60,600,1200,1800,0} | {17660,17310,55870,0,0,0}
>
>Age class is the basis of catcache truncation mechanism and shows the 
>distribution
>based on elapsed time since last access. As I didn't came up an appropriate 
>way, it is
>represented as two arrays.  Ageclass stores maximum age for each class in 
>seconds.
>Nentries holds entry numbers correnponding to the same element in ageclass. In 
>the
>above example,
>
> age class  : # of entries in the cache
>   up to   30s  : 17660
>   up to   60s  : 17310
>   up to  600s  : 55870
>   up to 1200s  : 0
>   up to 1800s  : 0
>   more longer  : 0
>
> The ageclass is {0, 0.05, 0.1, 1, 2, 3}th multiples of  cache_prune_min_age 
> on the
>backend.

I just thought that the pair of ageclass and nentries can be represented as
json or multi-dimensional array but in virtual they are all same and can be 
converted each other
using some functions. So I'm not sure which representaion is better one.

>3. non-transactional GUC setting (in 0003)
>
>It allows setting GUC variable set by the action GUC_ACTION_NONXACT(the name
>requires condieration) survive beyond rollback. It is required by remote guc 
>setting to
>work sanely. Without the feature a remote-set value within a trasction will 
>disappear
>involved in rollback. The only local interface for the NONXACT action is
>set_config(name, value, is_local=false, is_nonxact = true). 
>pg_set_backend_guc()
>below works on this feature.

TBH, I'm not familiar with around this and I may be missing something.
In order to change the other backend's GUC value,
is ignoring transactional behevior always necessary? When transaction of GUC 
setting
is failed and rollbacked, if the error message is supposeed to be reported I 
thought 
just trying the transaction again is enough.

>4. pg_set_backend_guc() function.
>
>Of course syscache statistics recording consumes significant amount of time so 
>it
>cannot be turned on usually. On the other hand since this feature is turned on 
>by GUC,
>it is needed to grab the active client connection to turn on/off the 
>feature(but we
>cannot). Instead, I provided a means to change GUC variables in another 
>backend.
>
>pg_set_backend_guc(pid, name, value) sets the GUC variable "name"
>on the backend "pid" to "value".
>
>
>
>With the above tools, w

RE: Copy data to DSA area

2018-11-13 Thread Ideriha, Takeshi

From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com]
>
>On Tue, Nov 13, 2018 at 9:45 AM Robert Haas  wrote:
>> On Thu, Nov 8, 2018 at 9:05 PM Thomas Munro
>>  wrote:
>> > * I had some ideas about some kind of "allocation rollback" interface:
>> > you begin an "allocation transaction", allocate a bunch of stuff
>> > (perhaps indirectly, by calling some API that makes query plans or
>> > whatever and is totally unaware of this stuff).  Then if there is an
>> > error, whatever was allocated so far is freed in the usual cleanup
>> > paths by a rollback that happens via the resource manager machinery.
>> > If you commit, then the allocation becomes permanent.  Then you only
>> > commit stuff that you promise not to leak (perhaps stuff that has
>> > been added to a very carefully managed cluster-wide plan cache).  I
>> > am not sure of the details, and this might be crazy...
>>
>> Hmm, my first thought was that you were just reinventing memory
>> contexts, but it's really not quite the same thing, because you want
>> the allocations to end up owned by a long-lived context when you
>> succeed but a transient context when you fail.  Still, if it weren't
>> for the fact that the memory context interface is hostile to dynamic
>> shared memory's map-this-anywhere semantics, I suspect we'd try to
>> find a way to make memory contexts fit the bill, maybe by reparenting
>> contexts or even individual allocations.  You could imagine using the
>> sorts of union algorithms that are described in
>> https://en.wikipedia.org/wiki/Disjoint-set_data_structure to get very
>> low asymptotic complexity here.
>
>Yeah, I was imagining something that still works with MemoryContexts.
>Interesting idea re: unions.  I haven't got as far as thinking about how you'd 
>actually
>make that work.  But I think we're both describing the same general kind of
>semantics; you need to be able to build stuff with automatic clean-up on 
>non-local exit,
>but also keep that stuff for longer once you decide you're ready.
>
>Anyway, avoiding all the hard questions about new kinds of foot gun for now, 
>here is a
>stupid hack that shows a DSA area inside the traditional fixed-address shared 
>memory
>region, wrapped in a custom (and somewhat defective for now) MemoryContext.  It
>shows a "List"
>being manipulated by standard PostgreSQL list code that is entirely unaware 
>that it is
>working in shared memory:
>
>postgres=# call hoge_list(3);
>NOTICE:  Contents of list:
>NOTICE:   1
>NOTICE:   2
>NOTICE:   3
>CALL
>
>You can call that procedure from various different backends to add and remove
>integers from a List.  The point is that it could just as easily be a query 
>plan or any
>other palloc-based stuff in our tree, and the pointers work in any backend.

Thanks for the patch. I know it's a rough sketch but basically that's what I 
want to do. 
I tried it and it works well.

Regards,
Takeshi Ideriha


RE: Copy data to DSA area

2018-11-13 Thread Ideriha, Takeshi
Thank you for the comment.

From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com]
>> I'm thinking to go with plan 1. No need to think about address
>> translation seems tempting. Plan 2 (as well as plan 3) looks a big project.
>
>The existing function dsa_create_in_place() interface was intended to support 
>that,
>but has never been used in that way so I'm not sure what extra problems will 
>come up.
>Here are some assorted thoughts:
>
>* You can prevent a DSA area from creating extra DSM segments, so that it is
>constrained to stay entirely in the space you give it, by calling 
>dsa_set_size_limit(area,
>size) using the same size that you gave to dsa_create_in_place(); now you have 
>a
>DSA area that manages a single fixed-sized chunk of memory that you gave it, 
>in your
>case inside the traditional shared memory segment (but it could be anywhere,
>including inside a DSM segment or another DSA area!)
Yeah, I will use it.

>* You can probably write a MemoryContext wrapper for it, if it has only one 
>segment
>that is in the traditional shared memory segment.
>You would need to do very simple kind of address translation: the result from 
>palloc()
>needs to be base + dsa_allocate()'s result, and the argument to pfree() needs 
>to be
>subtracted from base when
>dsa_free() is called.  That is a version of your idea C that should work AFAIK.

I didn't notice that if only one segment is used dsa_get_address() is not 
needed and
simple math is enough.

>* Once you have that working, you now have a new kind of resource management
>problem on your hands: memory leaks will be cluster-wide and cluster-life-time!
>That's hard, because the goal is to be able to use arbitrary code in the tree 
>that deals
>with plans etc, but that code all assumes that it can "throw" (elog()) on 
>errors.
>PostgreSQL C is generally "garbage collected" (in a way), but in this sketch, 
>that
>doesn't work anymore: this area *never* goes out of scope and gets cleaned up.
>Generally, languages with exceptions either need garbage collection or scoped
>destructors to clean up the mess, but in this sketch we don't have that 
>anymore...
>much like allocating stuff in TopMemoryContext, except worse because it 
>doesn't go
>away when one backend exits.
>
>* I had some ideas about some kind of "allocation rollback" interface:
>you begin an "allocation transaction", allocate a bunch of stuff (perhaps 
>indirectly, by
>calling some API that makes query plans or whatever and is totally unaware of 
>this
>stuff).  Then if there is an error, whatever was allocated so far is freed in 
>the usual
>cleanup paths by a rollback that happens via the resource manager machinery.
>If you commit, then the allocation becomes permanent.  Then you only commit 
>stuff
>that you promise not to leak (perhaps stuff that has been added to a very 
>carefully
>managed cluster-wide plan cache).  I am not sure of the details, and this 
>might be
>crazy...


Can I check my understanding?
The situation you are talking about is the following:
Data structure A and B will be allocated on DSA. Data A has a pointer to B.
Another data X is already allocated on some area (might be on local heap) and 
has a pointer variable X->a,
which will be linked to A.

 old_context = MemoryContextSwitchTo(dsa_memory_context);
 A = palloc();
 B = palloc();
 A->b = B; 
 X->a = A;
 MemoryContextSwitchTo(old_context);
 
If error happens in the way of this flow, palloc'd data (A & B) should be freed 
and pointer to freed data (X->a) should be back to its original one. 
So handling these cases introduces an "transaction" API like begin_allocate() 
and end_allocate().

I'm thinking begin_allocate() starts to keeping a record of palloc'd data 
until end_allocate() is done. If error occurs, just pfree() these data. 
However, to rollback pointers we need to 
take notes of old value of the pointer. This would introduce a new API like 
"points_to_pallocd_data(pointer, new_data)" to remember old_value and do `X->a 
= A`. 
To implement them I still need further consideration about how to fit or extend 
existing MemoryContext machinery.

Regards,
Takeshi Ideriha


RE: Copy data to DSA area

2018-11-08 Thread Ideriha, Takeshi
Hi, thank you for all the comment. 
It's really helpful.

>From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com]
>Sent: Wednesday, November 7, 2018 1:35 PM
>
>On Wed, Nov 7, 2018 at 3:34 PM Ideriha, Takeshi 
>
>wrote:
>> Related to my development (putting relcache and catcache onto shared
>> memory)[1],
>>
>> I have some questions about how to copy variables into shared memory, 
>> especially
>DSA area, and its implementation way.
>>
>> Under the current architecture when initializing some data, we
>> sometimes copy certain data using some specified functions
>>
>> such as CreateTupleDescCopyConstr(), datumCopy(), pstrdup() and so on.
>> These copy functions calls palloc() and allocates the
>>
>> copied data into current memory context.
>
>Yeah, I faced this problem in typcache.c, and you can see the function
>share_typledesc() which copies TupleDesc objects into a DSA area.
>This doesn't really address the fundamental problem here though... see below.

I checked share_tupledesc(). My original motivation came from copying tupledesc 
with constraint like CreateTupleDescCopyConstr().
But yes, as you stated here and bellow without having to copy
tupledesc constraint, this method makes sense.

>
>> But on the shared memory, palloc() cannot be used. Now I'm trying to
>> use DSA (and dshash) to handle data on the shared memory
>>
>> so for example dsa_allocate() is needed instead of palloc(). I hit upon 
>> three ways to
>implementation.
>>
>> A. Copy existing functions and write equivalent DSA version copy
>> functions like CreateDSATupleDescCopyConstr(),
>>
>>datumCopyDSA(), dsa_strdup()
>>
>>In these functions the logic is same as current one but would be replaced 
>> palloc()
>with dsa_allocate().
>>
>>But this way looks too straight forward and code becomes less readable and
>maintainable.
>>
>> B. Using current functions and copy data on local memory context temporarily 
>> and
>copy it again to DSA area.
>>
>>This method looks better compared to the plan A because we don't need to 
>> write
>clone functions with copy-paste.
>>
>>But copying twice would degrade the performance.
>
>It's nice when you can construct objects directly at an address supplied by 
>the caller.
>In other words, if allocation and object initialization are two separate 
>steps, you can
>put the object anywhere you like without copying.  That could be on the stack, 
>in an
>array, inside another object, in regular heap memory, in traditional shared 
>memory, in
>a DSM segment or in DSA memory.  I asked for an alloc/init separation in the 
>Bloom
>filter code for the same reason.  But this still isn't the real problem here...

Yes, actually I tried to create a new function TupleDescCopyConstr() which is 
almost same 
as TupleDescCopy() except also copying constraints. This is supposed to 
separate allocation 
and initialization. But as you pointed out bellow, I had to manage object graph 
with pointes 
and faced the difficulty.

>> C. Enhance the feature of palloc() and MemoryContext.
>>
>>This is a rough idea but, for instance, make a new global flag to tell 
>> palloc() to
>use DSA area instead of normal MemoryContext.
>>
>>MemoryContextSwitchToDSA(dsa_area *area) indicates following palloc() to
>allocate memory to DSA.
>>
>>And MemoryContextSwitchBack(dsa_area) indicates to palloc is used as 
>> normal
>one.
>>
>>MemoryContextSwitchToDSA(dsa_area);
>>
>>palloc(size);
>>
>>MemoryContextSwitchBack(dsa_area);
>>
>> Plan C seems a handy way for DSA user because people can take advantage of
>existing functions.
>
>The problem with plan C is that palloc() has to return a native pointer, but 
>in order to
>do anything useful with this memory (ie to share it) you also need to get the
>dsa_pointer, but the palloc() interface doesn't allow for that.  Any object 
>that holds a
>pointer allocated with DSA-hiding-behind-palloc() will be useless for another 
>process.

Agreed. I didn't have much consideration on this point.

>> What do you think about these ideas?
>
>The real problem is object graphs with pointers.  I solved that problem for 
>TupleDesc
>by making them *almost* entirely flat, in commit c6293249.  I say 'almost' 
>because it
>won't work for constraints or defaults, but that didn't matter for the 
>typcache.c case
>because it doesn't use those.  In other words I danced carefully around the 
>edge of
>the problem.
>
>In theory, object graphs, trees, lists etc could be implemented in a way that 
>allows for

Copy data to DSA area

2018-11-06 Thread Ideriha, Takeshi
Hi,

Related to my development (putting relcache and catcache onto shared memory)[1],
I have some questions about how to copy variables into shared memory, 
especially DSA area, and its implementation way.

Under the current architecture when initializing some data, we sometimes copy 
certain data using some specified functions
such as CreateTupleDescCopyConstr(), datumCopy(), pstrdup() and so on. These 
copy functions calls palloc() and allocates the
copied data into current memory context.

But on the shared memory, palloc() cannot be used. Now I'm trying to use DSA 
(and dshash) to handle data on the shared memory
so for example dsa_allocate() is needed instead of palloc(). I hit upon three 
ways to implementation.

A. Copy existing functions and write equivalent DSA version copy functions like 
CreateDSATupleDescCopyConstr(),
   datumCopyDSA(), dsa_strdup()
   In these functions the logic is same as current one but would be replaced 
palloc() with dsa_allocate().
   But this way looks too straight forward and code becomes less readable and 
maintainable.

B. Using current functions and copy data on local memory context temporarily 
and copy it again to DSA area.
   This method looks better compared to the plan A because we don't need to 
write clone functions with copy-paste.
   But copying twice would degrade the performance.

C. Enhance the feature of palloc() and MemoryContext.
   This is a rough idea but, for instance, make a new global flag to tell 
palloc() to use DSA area instead of normal MemoryContext.
   MemoryContextSwitchToDSA(dsa_area *area) indicates following palloc() to 
allocate memory to DSA.
   And MemoryContextSwitchBack(dsa_area) indicates to palloc is used as normal 
one.

   MemoryContextSwitchToDSA(dsa_area);

   palloc(size);

   MemoryContextSwitchBack(dsa_area);

Plan C seems a handy way for DSA user because people can take advantage of 
existing functions.

What do you think about these ideas?


[1]
https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A567B9245%40G01JPEXMBKW04


Regards,
Takeshi Ideriha




RE: Number of buckets/partitions of dshash

2018-11-06 Thread Ideriha, Takeshi
>From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]

>> This would cause some waste of memory on DSA because some partitions 
>> (buckets)
>is allocated but not used.
>>
>> So I'm thinking that current dshash design is still ok but flexible
>> size of partition is appropriate for some use cases like mine.
>>
>> Do you have any thoughts?
>
>We could do that easily, but shouldn't we find a way to reduce or eliminate 
>the impact
>of locking first? dshash needs to hold partition lock while the caller is 
>examining a
>returned entry.

Thanks for the comment. 
I agreed.
It would take a long time to achieve it but as you've pointed out
finding way to minimize the locking time seems benefit for everyone and first 
priority.

Regards,
Takeshi Ideriha




Number of buckets/partitions of dshash

2018-10-12 Thread Ideriha, Takeshi
Hi, let me clarify my understanding about the $title.

It seems that the number of hash partitions is fixed at 128 in dshash and
right now we cannot change it unless dshash.c code itself is changed, right?

According to the comment of dshash.c, DSHASH_NUM_PARTITIONS could be runtime 
parameter in future.
Are there someone working on it?
(I want to do it, but TBH I cannot promise it because of some other work.)

In my current development for allocating catalog cache on the shared memory[1]
I'm trying to put hash table of each CatCache to the shared memory using dshash.
The number of buckets for CatCache is pre-defined by cacheinfo and most of them 
is under 128 like 8 or 16.
This would cause some waste of memory on DSA because some partitions (buckets) 
is allocated but not used.

So I'm thinking that current dshash design is still ok but flexible size of 
partition is appropriate
for some use cases like mine.

Do you have any thoughts?

[1] 
https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A567B9245%40G01JPEXMBKW04

===
Takeshi Ideriha
Fujitsu Limited



RE: Protect syscache from bloating with negative cache entries

2018-10-03 Thread Ideriha, Takeshi
Hi, thank you for the explanation.

>From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
>>
>> Can I confirm about catcache pruning?
>> syscache_memory_target is the max figure per CatCache.
>> (Any CatCache has the same max value.) So the total max size of
>> catalog caches is estimated around or slightly more than # of SysCache
>> array times syscache_memory_target.
>
>Right.
>
>> If correct, I'm thinking writing down the above estimation to the
>> document would help db administrators with estimation of memory usage.
>> Current description might lead misunderstanding that
>> syscache_memory_target is the total size of catalog cache in my impression.
>
>Honestly I'm not sure that is the right design. Howerver, I don't think 
>providing such
>formula to users helps users, since they don't know exactly how many CatCaches 
>and
>brothres live in their server and it is a soft limit, and finally only few or 
>just one catalogs
>can reach the limit.

Yeah, I agree with that kind of formula is not suited for the document.
But if users don't know how many catcaches and brothers is used at postgres,
then how about changing syscache_memory_target as total soft limit of catcache,
rather than size limit of individual catcache. Internally 
syscache_memory_target can 
be divided by the number of Syscache and does its work. The total amount would 
be
easier to understand for users who don't know the detailed contents of catalog 
caches.

Or if user can tell how many/what kind of catcaches exists, for instance by 
using 
the system view you provided in the previous email, the current design looks 
good to me.

>The current design based on the assumption that we would have only one
>extremely-growable cache in one use case.
>
>> Related to the above I just thought changing sysycache_memory_target
>> per CatCache would make memory usage more efficient.
>
>We could easily have per-cache settings in CatCache, but how do we provide the 
>knobs
>for them? I can guess only too much solutions for that.
Agreed.

>> Though I haven't checked if there's a case that each system catalog
>> cache memory usage varies largely, pg_class cache might need more memory than
>others and others might need less.
>> But it would be difficult for users to check each CatCache memory
>> usage and tune it because right now postgresql hasn't provided a handy way to
>check them.
>
>I supposed that this is used without such a means. Someone suffers syscache 
>bloat
>just can set this GUC to avoid the bloat. End.
Yeah, I took the purpose wrong.

>Apart from that, in the current patch, syscache_memory_target is not exact at 
>all in
>the first place to avoid overhead to count the correct size. The major 
>difference comes
>from the size of cache tuple itself. But I came to think it is too much to 
>omit.
>
>As a *PoC*, in the attached patch (which applies to current master), size of 
>CTups are
>counted as the catcache size.
>
>It also provides pg_catcache_size system view just to give a rough idea of how 
>such
>view looks. I'll consider more on that but do you have any opinion on this?
>
>=# select relid::regclass, indid::regclass, size from pg_syscache_sizes order 
>by size
>desc;
>  relid  |   indid   |  size
>-+---+--
>-+---+--
>-+---+--
>-+---+--
> pg_class| pg_class_oid_index| 131072
> pg_class| pg_class_relname_nsp_index| 131072
> pg_cast | pg_cast_source_target_index   |   5504
> pg_operator | pg_operator_oprname_l_r_n_index   |   4096
> pg_statistic| pg_statistic_relid_att_inh_index  |   2048
> pg_proc | pg_proc_proname_args_nsp_index|   2048
>..

Great! I like this view.
One of the extreme idea would be adding all the members printed by 
CatCachePrintStats(), 
which is only enabled with -DCATCACHE_STATS at this moment. 
All of the members seems too much for customers who tries to change the cache 
limit size
But it may be some of the members are useful because for example cc_hits would 
indicate that current 
cache limit size is too small.

>> Another option is that users only specify the total memory target size
>> and postgres dynamically change each CatCache memory target size according 
>> to a
>certain metric.
>> (, which still seems difficult and expensive to develop per benefit)
>> What do you think about this?
>
>Given that few caches bloat at once, it's effect is not so different from the 
>current
>design.
Yes agreed.

>> As you commented here, guc variable syscache_memory_target and
>> syscache_prune_min_age are used for both syscache and relcache (HTAB), 

RE: Global shared meta cache

2018-10-03 Thread Ideriha, Takeshi
Hi, 

Thank you for the previous discussion while ago.
I’m afraid I haven't replied to all.

To move forward this development I attached a PoC patch.

I introduced a guc called shared_catacache_mem to specify 
how much memory is supposed be allocated on the shared memory area.
It defaults to zero, which indicates that no catalog cache is shared
but allocated on each backend MemoryContext (same as current Postgres).

At this moment this patch only allocates catalog cache header and CatCache data 
on the shared memory area.
It doesn't do much work, just starting and stopping postgres server with 
shared_catcache_mem non-zero.

Shared version CatCacheHdr is put on the postgres-initialized shared memory so 
that backends attach it
and build SysCache[] to store pointers of CatCache.
Each CatCache, CatCTup and CacCList is also allocated on the shared memory area,
where the limit size is the value of shared_catcache_mem and backed by DSA.
This area is first created at the postgres-initialized shared memory and 
re-initialized as DSA area
because the address of postgres-initialized shared area does not change among 
different process 
and hopefully makes it easy to handle pointers on the shared memory.
(Though I'm still struggling to grasp the idea of DSA and underlying DSM..)

The followings are major items I haven't touched:
- make hash table of each CatCache shared, which I'm going to take advantage of 
dshash
- how to evict shared cache (LRU mechanism)
- how to treat cache visibility and invalidation coming from transactions 
including DDL 
- how to alleviate the slowness compared to current PostgreSQL
- make relcache shared as well as catcache

If you have any insights/reactions/suggestions, please feel free to comment.


Takeshi Ideriha
Fujitsu Limited




0001-PoC-Allocate-catcache-on-the-shared-memory.patch
Description: 0001-PoC-Allocate-catcache-on-the-shared-memory.patch


RE: Size and size_t in dsa API

2018-09-20 Thread Ideriha, Takeshi
>> As a non-expert developer's opinion, I think mixing of Size and size_t makes 
>> difficult
>to understand source code.
>
>Agreed.  Let's change them all to size_t and back-patch that to keep future
>back-patching easy.  Patch attached.

Thank you for the quick action. I'm happy now.
I confirmed English word Size in comment is kept as it is.


Takeshi Ideriha
Fujitsu Limited


Size and size_t in dsa API

2018-09-20 Thread Ideriha, Takeshi
Hi, 

I'm trying to use DSA API and confused a little bit about $subject.

Some type of return value or arguments are defined as size_t
but others are as Size.

Example: 
- dsa_area *dsa_create_in_place(void *place, size_t size,
int tranche_id, dsm_segment *segment)

- Size dsa_minimum_size(void)

I'm confused because Size and size_t is virtually equal but 
they are mixed in dsa.c. At first I was thinking that there is
some distinguishing convention to use based on some context.

But according to this thread, which seems gone anywhere,
they are same. 
https://www.postgresql.org/message-id/CAEepm%3D1eA0vsgA7-2oigKzqg10YeXoPWiS-fCuQRDLwwmgMXag%40mail.gmail.com
 
Are there still someone trying to this?
As a non-expert developer's opinion, I think mixing of Size and size_t makes 
difficult to understand source code.

===
Takeshi Ideriha
Fujitsu Limited






RE: Protect syscache from bloating with negative cache entries

2018-09-11 Thread Ideriha, Takeshi
Hi, 

>Subject: Re: Protect syscache from bloating with negative cache entries
>
>Hello. The previous v4 patchset was just broken.

>Somehow the 0004 was merged into the 0003 and applying 0004 results in 
>failure. I
>removed 0004 part from the 0003 and rebased and repost it.

I have some questions about syscache and relcache pruning
though they may be discussed at upper thread or out of point.

Can I confirm about catcache pruning?
syscache_memory_target is the max figure per CatCache.
(Any CatCache has the same max value.)
So the total max size of catalog caches is estimated around or 
slightly more than # of SysCache array times syscache_memory_target.

If correct, I'm thinking writing down the above estimation to the document 
would help db administrators with estimation of memory usage.
Current description might lead misunderstanding that syscache_memory_target
is the total size of catalog cache in my impression.

Related to the above I just thought changing sysycache_memory_target per 
CatCache
would make memory usage more efficient.
Though I haven't checked if there's a case that each system catalog cache 
memory usage varies largely,
pg_class cache might need more memory than others and others might need less.
But it would be difficult for users to check each CatCache memory usage and 
tune it
because right now postgresql hasn't provided a handy way to check them.
Another option is that users only specify the total memory target size and 
postgres 
dynamically change each CatCache memory target size according to a certain 
metric.
(, which still seems difficult and expensive to develop per benefit)
What do you think about this?

+   /*  
 
+* Set up pruning.  
 
+*  
 
+* We have two knobs to control pruning and a hash can share them of
 
+* syscache.
 
+*  
 
+*/ 
 
+   if (flags & HASH_PRUNABLE)  
 
+   {   
 
+   hctl->prunable = true;  
 
+   hctl->prune_cb = info->prune_cb;
 
+   if (info->memory_target)
 
+   hctl->memory_target = info->memory_target;  
 
+   else
 
+   hctl->memory_target = _memory_target; 
 
+   if (info->prune_min_age)
 
+   hctl->prune_min_age = info->prune_min_age;  
 
+   else
 
+   hctl->prune_min_age = _prune_min_age; 
 
+   }   
 
+   else
 
+   hctl->prunable = false;

As you commented here, guc variable syscache_memory_target and
syscache_prune_min_age are used for both syscache and relcache (HTAB), right?

Do syscache and relcache have the similar amount of memory usage?
If not, I'm thinking that introducing separate guc variable would be fine.
So as syscache_prune_min_age.

Regards,

Takeshi Ideriha
Fujitsu Limited





RE: Global shared meta cache

2018-07-13 Thread Ideriha, Takeshi
Hi, Konstantin

>Hi,
>I really think that we need to move to global caches (and especially catalog 
>caches) in
>Postgres.
>Modern NUMA servers may have hundreds of cores and to be able to utilize all 
>of them,
>we may need to start large number (hundreds) of backends.
>Memory overhead of local cache multiplied by 1000 can be quite significant.

Yeah, thank you for the comment.


>I am quite skeptical concerning performance results you have provided.
>Once dataset completely fits in memory (which is true in your case), 
>select-only
>pgbench with prepared statements should be about two times faster, than without
>prepared statements. And in your case performance with prepared statements is 
>even
>worser.
>
>I wonder if you have repeated each measurement multiple time, to make sure 
>that it
>is not just a fluctuation.
>Also which postgresql configuration you have used. If it is default 
>postgresql.conf with
>128Mb shared buffers size, then you are measuring time of disk access and 
>catalog
>cache is not relevant for performance in this case.
>
>Below are result I got with pgbench scale 100 (with scale 10 results are 
>slightly better)
>at my desktop with just 16Gb of RAM and 4 ccore.:
>
>|master branch | prototype  | 
> proto/master
>(%)
>
> 
>pgbench -c10 -T60 -Msimple -S   | 187189  |182123 |97%
>pgbench -c10 -T60 -Msimple  | 15495   |15112  |97%
>pgbench -c10 -T60 -Mprepared -S | 98273   |92810  |94%
>pgbench -c10 -T60 -Mprepared| 25796   |25169  |97%
>
>As you see there are no surprises here: negative effect of shared cache is the 
>largest
>for the case of non-prepared selects (because selects themselves are much 
>faster
>than updates and during compilation we have to access relations multiple 
>times).
>

As you pointed out my shared_memory and scaling factor was too small.
I did the benchmark again with a new setting and my result seems to reproduce 
your result.

On the machine with 128GB memory and 16 cores, shared_buffer was set to 32GB and
db was initialized with -s100.

TPS result follows: (mean of 10 times measurement; round off the decimal) 
  |master branch | proto | 
proto/master (%)
   

  pgbench -c48 -T60 -j16 -Msimple -S|122140 | 114103 | 93
  pgbench -c48 -T60 -j16 -Msimple   | 7858  | 7822   | 100
  pgbench -c48 -T60 -j16 -Mprepared -S  |221740 | 210778 | 95
  pgbench -c48 -T60 -j16 -Mprepared | 9257  | 8998   | 97
  
As you mentioned, SELECT only query has more overheads.

( By the way, I think in the later email you mentioned about the result when 
the concurrent number of clients is larger.
 On this point I'll also try to check the result.)


Takeshi Ideriha
Fujitsu Limited




RE: ON CONFLICT DO NOTHING on pg_dump

2018-07-12 Thread Ideriha, Takeshi
>I noticed one more thing: pg_dumpall.c doesn't really need to prohibit
>--on-conflict-do-nothing without --insert.  Its existing validation rejects 
>illegal
>combinations of the settings that are *not* passed on to pg_dump.  It seems OK 
>to
>just pass those on and let pg_dump complain.  For example, if you say 
>"pg_dumpall
>--data-only --schema-only", it's pg_dump that complains, not pg_dumpall.  I 
>think we
>should do the same thing here.

Thank you for the clarification. I didn't give thought to pg_dumpall internally 
running pg_dump.

>Pushed, with those changes.
Thanks!


RE: ON CONFLICT DO NOTHING on pg_dump

2018-07-12 Thread Ideriha, Takeshi
Hi, thanks for the revision.

>
>+Add ON CONFLICT DO NOTHING clause in the INSERT commands.
>
>I think this would be better as: Add ON CONFLICT DO NOTHING 
>to
>INSERT commands.

Agreed.

>+printf(_("  --on-conflict-do-nothing dump data as INSERT
>commands with ON CONFLICT DO NOTHING \n"));
>
>That's slightly misleading... let's just use the same wording again, eg "add ON
>CONFLICT DO NOTHING to INSERT commands".

Agreed. But you forgot fixing it at pg_dump.c.
So could you please fix this and commit it?

Regards,
Takeshi Ideriha



RE: ON CONFLICT DO NOTHING on pg_dump

2018-07-10 Thread Ideriha, Takeshi
Hi,

>   The new structure member appears out of place, can you move up along
>   with other "command-line long options" ?
>
>
>
>Done
>

I did regression tests (make check-world) and 
checked manually pg_dump --on-conflict-do-nothing works properly.
Also it seems to me the code has no problem.
This feature has advantage to some users with small code change.

So I marked it as 'Ready for committer'.
I'd like to wait and see committers opinions.

Regards,
Takeshi Ideriha



RE: Global shared meta cache

2018-07-05 Thread Ideriha, Takeshi
>-Original Message-
>From: se...@rielau.com [mailto:se...@rielau.com]
>Sent: Wednesday, June 27, 2018 2:04 AM
>To: Ideriha, Takeshi/出利葉 健 ; pgsql-hackers
>
>Subject: RE: Global shared meta cache
>
>Takeshi-san,
>
>
>>My customer created hundreds of thousands of partition tables and tried
>>to select data from hundreds of applications, which resulted in
>>enormous consumption of memory because it consumed # of backend multiplied by
>#  of local memory (ex. 100 backends X 1GB = 100GB).
>>Relation caches are loaded on each backend local memory.
>My team and I have been working to make caches shared for the past two years, 
>but
>the system and rel caches we have chosen not to share..
>Reason being that these caches play a big role in transactional DDL processing.
>When you do DDL your backend can see all the changes since you update your own
>cache, but no anyone else's until you commit.
>You will find that dealing with that will be the true complexity.

Hi Serge,

Thank you for sharing your experience.
I didn't thought much about DDL visibility problem. 
Introducing version control like MVCC to catcache may solve the problem, but it 
seems too much to me.
It may be a good to keep local catcache for in-progress transaction rather than 
sharing everything. 
(Other hackers also pointed out it. )

>Have you tried to simply cap the size of these caches?
>That's a rather straight forward piece of work and will get you quite far.
>We run with a 20MB syscache and a 10MB relcache with 100k+ objects and hundreds
>of backends A dumb LRU is plenty good for the purpose.
>

I haven't tried yet but read some relevant discussion:
 
https://www.postgresql.org/message-id/flat/20161219.201505.11562604.horiguchi.kyot...@lab.ntt.co.jp

I think the cap solution alleviates memory bloating in some cases but there is 
a still problematic case if there are so many backends.

>That being said I would love to see these caches shared. :-)
Thank you!

Regards,
Takeshi


RE: Global shared meta cache

2018-07-04 Thread Ideriha, Takeshi
>-Original Message-
>From: AJG [mailto:ay...@gera.co.nz]
>Sent: Wednesday, June 27, 2018 3:21 AM
>To: pgsql-hack...@postgresql.org
>Subject: Re: Global shared meta cache
>
>Ideriha, Takeshi wrote
>> 2) benchmarked 3 times for each conditions and got the average result
>> of TPS.
>>  |master branch | prototype  |
>> proto/master (%)
>>
>> 
>>pgbench -c48 -T60 -Msimple -S   | 131297 |130541 |101%
>>pgbench -c48 -T60 -Msimple  | 4956   |4965   |95%
>>pgbench -c48 -T60 -Mprepared -S |129688  |132538 |97%
>>pgbench -c48 -T60 -Mprepared|5113|4615   |84%
>>
>>
>> 001_global_meta_cache.patch (6K)
>> http://www.postgresql-archive.org/attachment/6026686/0/001_global_
>> meta_cache.patch
>
>
>Hello,
>Apologies for question. I thought I would just double check percentages that 
>have
>been presented.
>Is the percentage calculation correct?
>as #1 and #3 look inverted to me (say lower when should be higher and vice 
>versa),
>and
>#2 and #4 look incorrect generally (percentages look much larger than they 
>should be
>based on numbers.
>
>I.e. Msimple -S the protype had slightly worse tps performance (130541) versus
>Master (131297). I would expect the percentage to be e.g. 99% not 101%
>
>But I may be misunderstanding something :)
>
>Also, Msimple is 4956 master versus 4965 prototype. Just 9 tps change. A very 
>slight
>improvement in tps. but the percentage provided is 95%. I would expect it to 
>be just
>over 100%?
>Again, maybe im not understanding, and hoping it is just my error :)
>
>
>
>--
>Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
>
Hi, 
Thank you for comments and sorry for late replay.
Thanks to you, I noticed I made a mistake.
As you pointed out, I think my calculation is wrong.

I also need to change some settings of postgresql.conf and pgbench. 
So I'm going to measure performance again and submit the result.

Regards,
Takeshi Ideriha




RE: libpq example doesn't work

2018-07-03 Thread Ideriha, Takeshi



>-Original Message-
>From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
>Sent: Sunday, July 1, 2018 9:17 PM
>To: Ideriha, Takeshi/出利葉 健 ; pgsql-hackers
>
>Subject: Re: libpq example doesn't work
>
>On 27.06.18 08:29, Ideriha, Takeshi wrote:
>> When I tried to use libpq, I found that libpq example[1] does not work.
>> That's because SELECT pg_catlog.set_config() never returns
>> PGRES_COMMAND_OK which is expected, but actually returns PGRES_TUPLES_OK.
>
>Fixed.  There were additional similar cases that I fixed also.
>
>--
>Peter Eisentraut  http://www.2ndQuadrant.com/
>PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Thank you for the commit. I totally missed the large object examples.

Regards,
Takeshi Ideriha




libpq example doesn't work

2018-06-27 Thread Ideriha, Takeshi
Hi, 

When I tried to use libpq, I found that libpq example[1] does not work.
That's because SELECT pg_catlog.set_config() never returns PGRES_COMMAND_OK 
which is expected,
but actually returns PGRES_TUPLES_OK.

Patch attached. 
I changed both src/test/example and document.

[1] https://www.postgresql.org/docs/devel/static/libpq-example.html 

(By the way, I'm wondering why these examples are placed under src/test.
 They are just examples and not regression tests.)

Best regards,
Takeshi Ideriha



0001-libpq-example-fix.patch
Description: 0001-libpq-example-fix.patch


RE: ON CONFLICT DO NOTHING on pg_dump

2018-06-26 Thread Ideriha, Takeshi
>> I agree with you though supporting MERGE or ON-CONFLICT-DO-UPDATE seems
>hard work.
>> Only ON-CONCLICT-DO-NOTHING use case may be narrow.
>
>Is it narrow, or is it just easy enough to add quickly?

Sorry for late replay.
I read your comment and rethought about it.
What I meant by "narrow" is that the number of people who use this new feature 
seems limited to me.
And I had a wrong impression that small improvements are always rejected by 
hackers.
But that's only the case for small improvements with huge source code 
modification. In short, cost/benefit ratio is low case.

This patch have some benefits with source code change is small.
So I just wait for other people reviews.

>And by the way, you don't need MERGE.  You can just generate INSERT/
>UPDATE/DELETE statements -- MERGE is mainly an optimization on that, and could
>wait until PG has a MERGE.
Oh, thank you for clarifying this. Now I understand it.

Best regards,
Takeshi Ideriha





Global shared meta cache

2018-06-26 Thread Ideriha, Takeshi
Hi, hackers!

My customer created hundreds of thousands of partition tables and tried to 
select data from hundreds of applications,
which resulted in enormous consumption of memory because it consumed # of 
backend multiplied by # of local memory (ex. 100 backends X 1GB = 100GB).
Relation caches are loaded on each backend local memory. 

To address this issue I'm trying to move meta caches like catcache or relcache 
into shared memory.

This topic seems to have been discussed several times.
For instance this thread: 
https://www.postgresql.org/message-id/CA%2BTgmobjDw_SWsxyJwT9z-YOwWv0ietuQx5fb%3DWEYdDfvCbzGQ%40mail.gmail.com
 

In my understanding, it discussed moving catcache and relcache to shared memory 
rather than current local backend memory,
and is most concerned with performance overhead.

Robert Haas wrote:
> I think it would be interested for somebody to build a prototype here
> that ignores all the problems but the first and uses some
> straightforward, relatively unoptimized locking strategy for the first
> problem. Then benchmark it. If the results show that the idea has
> legs, then we can try to figure out what a real implementation would
> look like.
> (One possible approach: use Thomas Munro's DHT stuff to build the shared 
> cache.)

I'm inspired by this comment and now developing a prototype (please see 
attached),
but I haven't yet put cache structure on shared memory.
Instead, I put dummy data on shared memory which is initialized at startup, 
and then acquire/release lock just before/after searching/creating catcache 
entry.

I haven't considered relcache and catcachelist either.
It is difficult for me to do everything at one time with right direction. 
So I'm trying to make small prototype and see what I'm walking on the proper 
way.

I tested pgbench to compare master branch with my patch. 

0) Environment
   - RHEL 7.4
   - 16 cores
   - 128 GB memory

1) Initialized with pgbench -i -s10

2) benchmarked 3 times for each conditions and got the average result of TPS.
 |master branch | prototype  | 
proto/master (%)
   

   pgbench -c48 -T60 -Msimple -S   | 131297|130541 |101%
   pgbench -c48 -T60 -Msimple  | 4956  |4965   |95%
   pgbench -c48 -T60 -Mprepared -S |129688 |132538 |97%
   pgbench -c48 -T60 -Mprepared|5113   |4615   |84%

  This result seems to show except for prepared protocol with "not only SELECT" 
it didn't make much difference.
   

What do you think about it?
Before I dig deeper, I want to hear your thoughts.

Best regards,
Takeshi Ideriha



001_global_meta_cache.patch
Description: 001_global_meta_cache.patch


RE: ON CONFLICT DO NOTHING on pg_dump

2018-06-14 Thread Ideriha, Takeshi
Hi,

>-Original Message-
>From: Surafel Temesgen [mailto:surafel3...@gmail.com]
>thank you for the review
>
>   Do you have any plan to support on-conlict-do-update? Supporting this 
> seems
>to me complicated and take much time so I don't mind not implementing this.
>
>
>i agree its complicated and i don't have a plan to implementing it.

Sure.

>thank you for pointing me that i add basic test and it seems to me the rest of 
>the test
>is covered by column_inserts test

Thank you for updating. 
Just one comment for the code.

+   if (dopt.do_nothing && !(dopt.dump_inserts || dopt.column_inserts))
I think you can remove dopt.column_inserts here because at this point 
dopt.dump_inserts has
already turned true if --column-inserts is specified. 

But I'm just inclined to think that use case of this feature is vague.
Could you specify more concrete use case that is practical for many users?
(It would lead to more attention to this patch.)
For example, is it useful to backup just before making a big change to DB like 
delete tables?

===
Takeshi


RE: ON CONFLICT DO NOTHING on pg_dump

2018-06-14 Thread Ideriha, Takeshi
>-Original Message-
>From: Nico Williams [mailto:n...@cryptonector.com]
>On Tue, Jun 12, 2018 at 09:05:23AM +, Ideriha, Takeshi wrote:
>> >From: Surafel Temesgen [mailto:surafel3...@gmail.com]
>> >Subject: ON CONFLICT DO NOTHING on pg_dump
>>
>> >Sometimes I have to maintain two similar database and I have to update one 
>> >from
>the other and notice having the option to add ON CONFLICT DO NOTHING clause to
>>INSERT command in the dump data will allows pg_restore to be done with free of
>ignore error.
>>
>> Hi,
>> I feel like that on-conflict-do-nothing support is useful especially coupled 
>> with
>--data-only option.
>> Only the difference of data can be restored.
>
>But that's additive-only.  Only missing rows are restored this way, and 
>differences are
>not addressed.
>
>If you want restore to restore data properly and concurrently (as opposed to 
>renaming
>a new database into place or whatever) then you'd want a) MERGE, b) dump to
>generate MERGE statements.  A concurrent data restore operation would be rather
>neat.

I agree with you though supporting MERGE or ON-CONFLICT-DO-UPDATE seems hard 
work.
Only ON-CONCLICT-DO-NOTHING use case may be narrow.

--
Takeshi 




RE: ON CONFLICT DO NOTHING on pg_dump

2018-06-12 Thread Ideriha, Takeshi
>From: Surafel Temesgen [mailto:surafel3...@gmail.com] 
>Subject: ON CONFLICT DO NOTHING on pg_dump

>Sometimes I have to maintain two similar database and I have to update one 
>from the other and notice having the option to add ON CONFLICT DO NOTHING 
>clause to >INSERT command in the dump data will allows pg_restore to be done 
>with free of ignore error.

Hi,
I feel like that on-conflict-do-nothing support is useful especially coupled 
with --data-only option.
Only the difference of data can be restored.

>The attache patch add --on-conflect-do-nothing option to pg_dump in order to 
>do the above. 

The followings are some comments.

+  --on-conflect-do-nothing
Here's a typo: conflect -> conflict. This typo also applies to pg_dump.c

printf(_("  --insertsdump data as INSERT commands, 
rather than COPY\n"));
+   printf(_("  --on-conflect-do-nothing dump data as INSERT commands 
with on conflect do nothing\n"));
printf(_("  --no-commentsdo not dump comments\n"));

The output of help should be in alphabetical order according to the convention. 
So changing the order seems logical. 
Please apply my review to the documentation as well.
By the way, 4d6a854 breaks the patch on this point.

+This option is not valid unless --inserts is also 
specified.
+   
   
+   if (dopt.do_nothing && !dopt.dump_inserts)
+   exit_horribly(NULL, "option --on-conflect-do-nothing requires 
option --inserts\n");

How about mentioning --column-inserts? --on-conflict-do-nothing with 
--column-inserts should work.

Do you have any plan to support on-conlict-do-update? Supporting this seems to 
me complicated and take much time so I don't mind not implementing this.

What do you think about adding some test cases? 
command_fails_like() at 001_basic.pl checks command fail pattern with invalid 
comnibation of option.
And 002_pg_dump.pl checks the feature iteself.

Regards,
Takeshi Ideriha


RE: Update backend/lib/README

2018-05-23 Thread Ideriha, Takeshi
>Seems reasonable. Pushed, thanks!
>
>- Heikki
>
Thanks for the quick work!

Ideriha, Takeshi 




Update backend/lib/README

2018-05-22 Thread Ideriha, Takeshi
Hi, 

When I looked into the dshash.c, I noticed that dshash.c, bipartite_match.c and 
knapsack.c are not mentioned at README.
The other files at src/backend/lib are mentioned. I'm not sure this is an 
intentional one or just leftovers.

Does anyone have opinions?

Patch attached.
- add summary of three files mentioned above (copied from each file's comment)
- change the list to alphabetical order

===
Takeshi Ideriha
Fujitsu Limited




backend_lib_readme.patch
Description: backend_lib_readme.patch


RE: Postgres 11 release notes

2018-05-21 Thread Ideriha, Takeshi
>-Original Message-
>From: Bruce Momjian [mailto:br...@momjian.us]
>I have committed the first draft of the Postgres 11 release notes.  I will add 
>more
>markup soon.  You can view the most current version here:

Hi, there is a small typo.

I think "These function" should be "These functions".
(At the next sentece "these functions" is used.) 

Regards,
Ideriha, Takeshi


release_note_typo.patch
Description: release_note_typo.patch


RE: log_min_messages shows debug instead of debug2

2018-05-17 Thread Ideriha, Takeshi
>-Original Message-
>From: Robert Haas [mailto:robertmh...@gmail.com]
>OK, I'm happy enough to commit it then, barring other objections.  I was just 
>going to
>just do that but then I realized we're in feature freeze right now, so I 
>suppose this
>should go into the next CommitFest.

Thank you for your discussion.
Sure, I registed it to the next CommitFest with 'Ready for Committer'.

https://commitfest.postgresql.org/18/1638/ 

Regards,
Takeshi Ideriha


log_min_messages shows debug instead of debug2

2018-05-15 Thread Ideriha, Takeshi
Hi,

I noticed that if log_min_messages is set to ‘debug2’, it shows ‘debug’ instead 
of debug2.
Showing other debug options like debug1 work normally.
This is same for client_min_messages.

According to a033daf56 and baaad2330, debug is an alias for debug2 and for 
backward-compatibility.
And also these debug options are mainly used for PostgreSQL hackers, so I think 
this is a trivial issue.

Patch attached.
The output of ‘show log_min_messages’ becomes ‘debug2’ if it’s set to either 
debug or debug2.
It passed make installcheck.

I just changed the order of server_message_level_options[]
because in this case SHOW command linearly searches GUC option by 
config_enum_lookup_by_value().

What do you think about it?
If it’s worth committing, do I need test? I couldn’t find the relevant test set.


Takeshi Ideriha
Fujitsu Limited



debug2_log_level.patch
Description: debug2_log_level.patch