Simon Riggs wrote
>> Can you CLUSTER
>> against a minmax index?
>
> Not in this release, at least in my understanding. It's not yet
> possible to do an ordered fetch, so the cluster scan probably won't
> work.
As per the patch I helped writing, CLUSTER should use the
sequential heap scan+sort whe
Simon Riggs wrote
> From our discussions here, IMHO there is a strong case for avoiding
> btrees completely for larger historical data tables. That isn't
> something I had even considered as desirable before this conversation
> but ISTM now that taking that approach will be more fruitful than
> att
Jeremy Harris wrote
> Surely there's good correlation between IMSI & IMEI, so have a separate
> table to translate one to (a group of) the others, and
> halve the indexes on your main table?
Yes; unfortunately not always both are available; but it's something
we are thinking about (it requires log
Simon Riggs wrote
> So in the use case you describe, the min max index would require a
> scan of only 25% of the table, not the 80% described earlier for
> random inserts. In my experience, people wish to keep data for much
> longer periods and so the percentage of scan required would drop lower
>
Simon Riggs wrote
> On 5 November 2013 14:28, Leonardo Francalanci <
> m_lists@
> > wrote:
>
>> Either my sql is not correct (likely), or my understanding of the minmax
>> index is
>> not correct (even more likely), or the minmax index is not usab
Jeff Janes wrote
> Some experiments I did a few years ago showed that applying sorts to the
> data to be inserted could be helpful even when the sort batch size was as
> small as one tuple per 5 pages of existing index. Maybe even less.
Cool!!! Do you have any idea/hint on how I could try and rep
Claudio Freire wrote
> Well, of course, they're not magic pixie dust.
Of course they aren't. I think they can make a difference in a sequential
input scenario. But I'm not the one who said that they are fit to
solve the problems me and other people are talking about in this thread.
Claudio Frei
Claudio Freire wrote
> real data isn't truly random
Well, let's try normal_rand???
create table t1 as select trunc(normal_rand(100, 50, 3)) as n,
generate_series(1, 100) as i;
with cte as
(select min(n) as minn, max(n) as maxn, i/100 from t1 group by i/100),
Claudio Freire wrote
> Min-max indexes always require a sequential scan of the min-max index
> itself when querying.
I'm worried about the number of heap pages that will be scanned.
My understanding is that given the random input, the index will
not be selective enough, and will end up requiring
Simon Riggs wrote
> On 5 November 2013 09:57, Leonardo Francalanci <
> m_lists@
> > wrote:
>> While I do believe in testing (since "In theory there is no difference
>> between theory and practice. In practice there is"), I would like to know
>> the &qu
Simon Riggs wrote
> Minmax indexes seem to surprise many people, so broad generalisations
> aren't likely to be useful.
>
> I think the best thing to do is to publish some SQL requests that
> demonstrate in detail what you are trying to achieve and test them
> against minmax indexes. That way we c
Simon Riggs wrote
> Everybody on this thread is advised to look closely at Min Max indexes
> before starting any further work.
>
> MinMax will give us access to many new kinds of plan, plus they are
> about as close to perfectly efficient, by which I mean almost zero
> overhead, with regard to ins
Andres Freund-3 wrote
> On 2013-11-04 11:27:33 -0500, Robert Haas wrote:
>> On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire <
> klaussfreire@
> > wrote:
>> > Such a thing would help COPY, so maybe it's worth a look
>>
>> I have little doubt that a deferred insertion buffer of some kind
>> could
Gavin Flower-2 wrote
> How about being able to mark indexes:
> 'MEMORY ONLY' to make them not go to disk
> and
> 'PERSISTENT | TRANSIENT' to mark if they should be recreated on
> machine bootup?
I would love that. But:
1) I'd like to make some tests with a "memory drive", and confirm t
Jeff Janes wrote
> True, but that is also true of indexes created in bulk. It all has to
> reach disk eventually--
> [...]
> If the checkpoint interval is as long as the partitioning period, then
> hopefully the active index buffers get re-dirtied while protected in
> shared_buffers, and only get
Jeff Janes wrote
> You could periodically merge older partitions into larger tables, index
> those aggregated tables, then transactionally disinherit the old
> partitions
> and inherit the new aggregated one. This would keep the value of K down,
> at the expense of re-writing data multiple times (
Jeff Janes wrote
> Are partitions read-only once time has moved on, or can stragglers show up
> that need to be inserted into older partitions?
>
> You could periodically merge older partitions into larger tables, index
> those aggregated tables, then transactionally disinherit the old
> partition
> Point being: hardware is marching along pretty fast (after 20+ years
> of stagnation) and it's dangerous (IMO) to make big software
> investments based on the situation on the ground *today*.
Yes, that's a good point.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To m
Jeff Janes wrote
> The index insertions should be fast until the size of the active part of
> the indexes being inserted into exceeds shared_buffers by some amount
> (what
> that amount is would depend on how much dirty data the kernel is willing
> to
> allow in the page cache before it starts suff
> LSM-trees seem patent free
I'm no expert, and I gave it just a look some time ago: it looked to me very
complicated to get right... and as far as I remember you don't get that much
gain, unless you go multi-level which would complicate things further
> Please somebody advise patent status of
> What is the reason for needing such fast access to individual groups
> of records? Sure sounds like the NSA or similar ;-)
Users need to search all calls originated from/to a user or from/to a specific
mobile phone to answer/analyze customers' probl... ok, I give up: I work for
the NSA ;)
>
> Presumably the data you are inserting isn't actually random. Please
> describe the use case you are considering in more detail and some view
> on how frequent that is, with some examples. Once we understand the
> use case and agree it is important, we might solve problems.
Collecting calls data
> Hmm, you realise Alvaro is working on MinMax indexes in this release?
> They are very efficient with regard to index inserts and specially
> designed for use on large tables.
>
> Prior work by Heikki on Grouped Item Tuples was a way of reducing the
> size of indexes, yet still allowing uniquenes
> Hmm, you realise Alvaro is working on MinMax indexes in this release?
> They are very efficient with regard to index inserts and specially
> designed for use on large tables.
>
> Prior work by Heikki on Grouped Item Tuples was a way of reducing the
> size of indexes, yet still allowing uniquene
> I bet you've mis-diagnosed the problem. Btrees don't have a problem
> keeping up with 50m records; you're problem is that after a certain
> point your page cache can't keep up with the pseudo-random i/o
> patterns and you start seeing faults to storage.
> [...] This has nothing to do the btree
> They should, in theory, be faster than btrees -- O(1) not O(log N) page
> fetches per lookup. In practice they don't seem to be faster, and
> nobody's bothered to find out exactly why. Again, this isn't a terribly
> encouraging precedent for implementing some other index type that's
> supposed
> Another point to add: I don't really see btree as a barrier to
> performance for most of the problems I face. The real barriers to
> database performance are storage, contention, and query planning.
Ehm that's true for regular OLTP stuff, which I understand is what most (95%?)
of people use/ne
> Before getting too excited about some new academic index type, it's worth
> noting the sad state in which hash indexes have languished for years.
> Nobody's bothered to add WAL support, let alone do any other real work
> on them. The non-btree index types that have been getting love are the
> on
Hi,
I don't see much interest in insert-efficient indexes. These are the ones I've
found:
- LSM-tree (used by Cassandra and SQLite4?)
- Y-Tree
(http://www.bossconsulting.com/oracle_dba/white_papers/DW%20in%20oracle/P23%20(ytree%20index%20structure%20for%20DWs).pdf
)
- Fractal indexes (TokuDB,
Agreed.
I do agree with Heikki that it really ought to be the OS problem, but
then we thought that about dtrace and we're still waiting for that or
similar to be usable on all platforms (+/- 4 years).
My point is that it looks like this is going to take 1-2 years in
postgresql, so it looks li
I think what I meant was: isn't this going to be useless in a couple
of years (if, say, btrfs will be available)? Or it actually gives
something that FS will never be able to give?
Yes, it will help you find/address bugs in the filesystem. These things
are not unheard of...
It sounds to me li
On 21/12/2011 16.19, Stephen Frost wrote:
* Leonardo Francalanci (m_li...@yahoo.it) wrote:
I can't help in this discussion, but I have a question:
how different would this feature be from filesystem-level CRC, such
as the one available in ZFS and btrfs?
Depends on how much you trus
I can't help in this discussion, but I have a question:
how different would this feature be from filesystem-level CRC, such as
the one available in ZFS and btrfs?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.o
> New API AnalyzeForeignTable
I didn't look at the patch, but I'm using CSV foreign tables with named pipes
to get near-realtime KPI calculated by postgresql. Of course, pipes can be
read just once, so I wouldn't want an "automatic analyze" of foreign tables...
--
Sent via pgsql-hackers mailin
> compression is an interesting topic: the guys over at tokudb are
> making some wild claims...i'm curious if they are real, and what the
> real tradeoffs are.
I don't know how much of the performance they claim comes from
compression and how much from the different indexing technique they
use
> I've nearly finished editing prior to commit, so no worries.
Thank you, let me know if I can help.
Leonardo
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
> Leonardo,
>
> Your patch:
>
> use less space in xl_xact_commit
>
> ... has been waiting on an updated version from you for 10 days now. Do
> you think you're likely to complete it for this CommitFest?
I sent an email on the subject:
http://postgresql.1045698.n5.nabble.com/use-less-space-
> The important thing is that we retain backwards compatibility with
> current XLOG_XACT_COMMIT. I'm not worried what we call the other one.
Ok, let me see if I got it right:
#define XLOG_XACT_COMMIT0x00
should become:
#define XLOG_XACT_COMMIT_WITH_INFO 0x00
and I'll add a
> With regards to the naming, I think it would be better if we kept
> XLOG_XACT_COMMIT record exactly as it is now, and make the second
> record an entirely new record called XLOG_XACT_COMMIT_FASTPATH. That
> way we retain backwards compatibility.
>
> If you'd like to rework like that please,
> > On Wed, May 25, 2011 at 3:05 PM, Simon Riggs
wrote:
> > Leonardo, can you submit an updated version of this patch today that
> > incorporates Simon's suggestion?
Mmmh, maybe it was simpler than I thought; I must be
missing something... patch attached
How can I test it with "weird" s
> On Tue, Jun 14, 2011 at 2:31 PM, Simon Riggs wrote:
> Well, we certainly have the option to review and commit the patch any
> time up until feature freeze. However, I don't want the CommitFest
> application to be full of entries for patches that are not actually
> being worked on, because it
> We don't need to be in a hurry here. As the reviewer I'm happy to give
> Leonardo some time, obviously no more than the end of the commit fest.
>
> If he doesn't respond at all, I'll do it, but I'd like to give him the
> chance and the experience if possible.
Sorry I couldn't update the pat
> Well, I sort of assumed the design was OK, too, but the more we talk
> about this WAL-logging stuff, the less convinced I am that I really
> understand the problem. :-(
I see. In fact, I think nobody thought about restart points...
To sum up:
1) everything seems ok when in the wal_level =
> I think
> we need a detailed design document for how this is all going to work.
> We need to not only handle the master properly but also handle the
> slave properly. Consider, for example, the case where the slave
> begins to replay the transaction, reaches a restartpoint after
> replaying
> Why is it necessary to replay the operation only on the slave? Can we
> just use XLOG_HEAP_NEWPAGE?
Uh, I don't know why but I thought I shouldn't log a page on the master,
since all the pages are already there and fsync-ed. But if it makes no harm,
I can easily use XLOG_HEAP_NEWPAGE (of co
> From: Noah Misch
> > - the patch is missing the "send all table pages to the
> > standby" part; is there some code I can use as base?
>
> Nothing comes to mind as especially similar.
>
> > I guess I have to generate some special log type that
> > is only "played" by standby servers.
>
> W
Sorry, email sent without body.
Fixed some English mistakes.
commitlog_lessbytes02.patch
Description: Binary data
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
commitlog_lessbytes02.patch
Description: Binary data
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
> Da: Simon Riggs
> I can't find a clear discussion of what you are trying to do, and how,
> just a URL back to a complex discussion on another topic.
While trying to write a patch to allow changing an unlogged table into
a logged one, I had to add another int field to xl_xact_commit.
Robert H
> - the patch is missing the "send all table pages to the
> standby" part; is there some code I can use as base?
> I guess I have to generate some special log type that
> is only "played" by standby servers.
Maybe I could use log_newpage, but instead of
XLOG_HEAP_NEWPAGE I could use something
I'll try to sum up what I understood:
1) the standby keeps the lock, so no problem with
stray files coming from the unlogged->logged log
reply, as the table can't be read during the operation
2) calling ResetUnloggedRelations before
ProcArrayApplyRecoveryInfo would remove the problem
of the stra
> I'd guess some WAL record arising from the post-crash master restart makes
the
> standby do so. When a crash isn't involved, the commit or abort record is
>that
> signal. You could test and find out how it happens after a master crash
> with
>a
> procedure like this:
>
> 1. Start a mas
> On Wed, May 18, 2011 at 04:02:59PM +0100, Leonardo Francalanci wrote:
> > > By the time the startup process
> > > releases the AccessExclusiveLock acquired by the proposed
> > > UNLOGGED -> normal conversion process, that relfilenode
> > > needs t
> By the time the startup process
> releases the AccessExclusiveLock acquired by the proposed
> UNLOGGED -> normal conversion process, that relfilenode
> needs to be either fully copied or unlinked all over again.
> (Alternately, find some other way to make sure queries don't
> read the half-c
this is a second version: now using
intcounts[1]; /* variable-length array of counts */
in xl_xact_commit to keep track of number of
different arrays at the end of the struct.
Waiting for feedbacks...
Leonardo
commitlog_lessbytes00.patch
Description: Binary data
--
Se
> int counts[1]; /* variable-length array of counts, xinfo flags define
> length of array and meaning of counts */
Damn, that's much cleaner than what I did. I don't know why
I stuck with the idea that it had to be:
int
array
int
array
...
instead of:
int
int
...
array
array
...
which mak
Hi,
following the conversation at
http://postgresql.1045698.n5.nabble.com/switch-UNLOGGED-to-LOGGED-tp4290461p4382333.html
I tried to remove some bytes from xl_xact_commit.
The way I did it needs palloc+memcpy. I guess it could be done
reusing the memory for smgrGetPendingDeletes. But I don
> I don't think making xinfo shorter will save anything, because
> whatever follows it is going to be a 4-byte quantity and therefore
> 4-byte aligned.
ups, didn't notice it.
I'll splitxinfo into:
uint16 xinfo;
uint16 presentFlags;
I guess it helps with the reading? I mean, instead
> Yes, that seems like a very appealing approach. There is plenty of
> bit-space available in xinfo, and we could reserve a bit each for
> nrels, nsubxacts, and nmsgs, with set meaning that an integer count of
> that item is present and clear meaning that the count is omitted from
> the struct
> On Fri, Apr 22, 2011 at 4:13 AM, Leonardo Francalanci
wrote:
> >> Maybe you should change xl_act_commit to have a separate list of rels to
> >> drop the init fork for (instead of mixing those with the list of files
to
> >> drop as a whole).
> >
&g
> > If that 1% is random (not time/transaction related), usually you'd rather
>have an empty table.
>
> Why do you think it would be random?
"Heap blocks would be zeroed if they were found to be damaged, following a
crash."
If you erase full blocks, you have no idea what data you erased; it
> The amount of data loss on a big
> table will be <1% of the data loss
>caused by truncating the whole table.
If that 1% is random (not time/transaction related), usually you'd rather have
an empty table. In other words: is a table that is not consistant with anything
else in the db useful?
--
The only data we can't rebuild it's the heap. So what about an option for
UNlogged indexes on a LOGged table? It would always preserve data, and it would
'only' cost a rebuilding of the indexes in case of an unclean shutdown. I think
it would give a boost in performance for all those cases where
> > another question regarding indexes. Sadly I can't find enough info in
> > the documentation. Which of the functions are needed in order for a
> > index to work?
>
> All of them.
Maybe I completely misunderstood the question, but some functions
are "optionals", such as amgetbitmap, right?
> Maybe you should change xl_act_commit to have a separate list of rels to
> drop the init fork for (instead of mixing those with the list of files to
> drop as a whole).
I tried to follow your suggestion, thank you very much.
Here's a first attempt at the patch.
I "tested" it with:
crea
I think I coded a very basic version of the UNLOGGED to LOGGED patch
(only wal_level=minimal case for the moment).
To remove the INIT fork, I changed somehow PendingRelDelete to have
a flag "bool onlyInitFork" so that the delete would remove only the INIT
fork at commit.
Everything "works" (note
> > If the master crashes while a transaction that used CREATE TABLE is
>unfinished,
> > both the master and the standby will indefinitely retain identical, stray
>(not
> > referenced by pg_class) files. The catalogs do reference the relfilenode
of
> > each unlogged relation; currently, tha
> > But re-reading it, I don't understand: what's the difference in creating
> > a new "regular" table and crashing before emitting the abort record,
> > and converting an unlogged table to logged and crashing before
> > emitting the abort record? How do the standby servers handle a
> > "CREATE
> I'm pretty sure we wouldn't accept a patch for a feature that would
> only work with wal_level=minimal, but it might be a useful starting
> point for someone else to keep hacking on.
I understand.
Reading your post at
http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php
I thoug
Hi,
I read the discussion at
http://archives.postgresql.org/pgsql-hackers/2011-01/msg00315.php
From what I can understand, going from/to unlogged to/from logged in
the wal_level == minimal case is not too complicated.
Suppose I try to write a patch that allows
ALTER TABLE tablename SET L
> Here's my single biggest tip for newcomers to the Postgres source:
> if you don't use ctags, glimpse, or some other tool that can quickly
> show you all references to a given identifier, go out and get one.
> It's one of the easiest ways to learn about things.
I guess Eclipse is the best to
> Cases with lots of irrelevant indexes. Zoltan's example had 4 indexes
> per child table, only one of which was relevant to the query. In your
> test case there are no irrelevant indexes, which is why the runtime
> didn't change.
Mmh... I must be doing something wrong. It looks to me it's
> Hmm, maybe I spoke too soon. With 9000 child tables I get a profile
> like this:
Well, the 9000-table-test-case was meant to check the difference in
performance with/without the patch... I don't see the reason for trying
to optimize such an unrealistic case.
BTW can someone explain to me whi
> This is going to be dominated by constraint exclusion checking. There's
> basically no fix for that except a more explicit representation of the
> partitioning rules.
Damn, I knew that was going to be more complicated :)
So in which case does this patch help? I guess in a multi-index
scena
> but I don't get any gain from the patch... explain time is still around 250
>ms.
> Tried with 9000 partitions, time is still 2 secs.
Small correction: I tried with 3000 partitions (FOR i IN 0..3000 ...)
and got 250ms with both versions, with 9000 partitions 2 secs (again
no gain from the pat
> On the other hand, if I use a similar test case to my original one
> (i.e. the tables are much wider) then the query planning takes
> 1.42 seconds in 9.1 with this patch instead of about 4.7 seconds
> as we observed it using PostgreSQL 9.0.0. The beginning of the gprof
> output now looks like
> Before you start coding, have you looked over what's in
> contrib/intarray ?
There's nothing that fulfills my needs there, but I guess it
would be the perfect place to watch for code examples!
Thank you
(I think I made my custom aggregation function work,
but I'll look into intarray code to
Hi,
I want to write a custom agg function that, given an "int4 index",
increments the element at "index" of an array and, at the end, returns
the array. The array will always be int4[].
I need it in C, since plpgsql is way slower (and I need to use it in
5M+ rows). I did it, but I also need to c
> I wonder if we could improve this with some sort of process-local
> cache - not to get rid of hint bits, just to reduce pg_clog
> contention. We might easily end up testing the same XID many times
> during the same table scan.
I guess that's my scenario... not that many transactions, so ev
> Reduction of contention for pg_clog access, for one thing. If you read
> the archives, you'll find that pg_clog access contention has been shown
> to be one cause of "context swap storms". Having to go to clog for
> every single tuple access would make that orders of magnitude worse.
Ok;
Hi,
I was wondering what is the advantage of having hint bits for OLAP
-style workloads, that is when the number of transactions is not
that high.
If I got it right, in 10 pg_clog pages we can store the status for more
than 32 transactions. That's a lot, in a very small space
(80KB?).
So I
> Applied with some significant editorialization. The biggest problem I
> found was that the code for expression indexes didn't really work, and
> would leak memory like there's no tomorrow even when it did work.
Sorry I couldn't write the way it was supposed to... I'll look at the
difference
> It sounds like the costing model might need a bit more work before we commit
>this.
I tried again the simple sql tests I posted a while ago, and I still get the
same ratios.
I've tested the applied patch on a dual opteron + disk array Solaris machine.
I really don't get how a laptop hard dr
> I ran a few more performance tests on this patch. Here's what I got
> for the tests Leonardo posted originally:
>* 2M rows: 22 seconds for seq. scan, 24 seconds for index scan
>* 5M rows: 139 seconds for seq. scan, 97 seconds for index scan
>* 10M rows: 256 seconds seq. scan, 61
> > Here's my post with a (very simple) performance test:
> > http://archives.postgresql.org/pgsql-hackers/2010-02/msg00766.php
> I think the 10M rows test is more in line with what we want (83s vs. 646).
Can someone else test the patch to see if what I found is still valid?
I don't think it ma
> > 10% is nothing. I was expecting this patch would give an order of
> > magnitude of improvement or somethine like that in the worst cases of
> > the current code (highly unsorted input)
>
> Yes. It should be x10 faster than ordinary method in the worst cases.
Here's my post with a (very
> I think writetup_rawheap() and readtup_rawheap() are a little complex,
> but should work as long as there are no padding between t_len and t_self
> in HeapTupleData struct.
>
> - It might be cleaner if you write the total item length
> and tuple data separately.
> - "(char *) tuple + sizeo
87 matches
Mail list logo