Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Thu, Jul 2, 2015 at 7:33 PM, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: On 2 July 2015 at 14:08, Heikki Linnakangas hlinn...@iki.fi wrote: I'm marking this as returned with feedback in the commitfest. There are no unresolved issues with the approach, nor is it true it is slower. If you think there are some, you should say what they are, not act high handedly to reject a patch without reason. Have you read the thread? I have read the thread again with a focus on the various problems (objections) raised for this patch and here is a summarization of my reading. 1. lseek can lie about EOF which could lead to serious problem during checkpoint if drop table misses to remove the shared buffer belonging to the table-to-be-dropped. This problem can be solved by maintaining the list of dropped relations and then while checkpoint clean such buffers during buffer-pool scan. Something similar is already used to avoid similar problems during FSync. 2. Patch doesn't cater to DROP TABLESPACE and DROP DATABASE operations. It would have been better if there could be a simpler solution for these operations as well, but even if we have something that generic to avoid problems for these operations, I think there is no reason why it can't be easily adopted for Drop Table operation as the changes proposed by this patch are very much localized to one function (which we have to anyway change even without patch if we come-up with a generic mechanism), the other changes required to avoid the problem-1 (lseek problem) would still be required even when we have patch for generic approach ready. As mentioned by Andrew, another thing to note is that these operations are much less used as compare to Drop/Truncate Table, so I think optimzing these are of somewhat lower priority. 3. Can't close-and-open a file (to avoid lseek lie about EOF or otherwise) as that might lead to a failure if there is flush operation for file in parallel. I haven't checked about this, but I think we can find some way to check if vm and fsm files exist before checking the number of blocks for those files. Apart from above, Heikki mentioned about overflow for total number of blocks calculation, which I think is relatively simpler problem to fix. There were plenty of objections, as well as a design for a better solution. I think here by better solution you mean radix based approach or something similar, first I don't see any clear design for the same and second even if tomorrow we have patch for the same ready, it's not very difficult to change the proposed solution even after it is committed as the changes are very much localized to one function. In addition, we should think about more holistic solutions such as Andres' nearby proposal (which I doubt will work as-is, but maybe somebody will think of how to fix it). Committing a band-aid will just make it harder to pursue real fixes. Right, but OTOH waiting for a long time to have some thing much more generic doesn't sound wise either, especially when we can replace the generic solution without much difficulty. Having said above, I am not wedded to work on this idea, so if you and or others have no inclination for the work in this direction, then I will stop it and lets wait for the day when we have clear idea for some generic way. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 07/02/2015 04:33 PM, Simon Riggs wrote: On 2 July 2015 at 14:08, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/27/2015 07:45 AM, Amit Kapila wrote: Sometime back on one of the PostgreSQL blog [1], there was discussion about the performance of drop/truncate table for large values of shared_buffers and it seems that as the value of shared_buffers increase the performance of drop/truncate table becomes worse. I think those are not often used operations, so it never became priority to look into improving them if possible. I have looked into it and found that the main reason for such a behaviour is that for those operations it traverses whole shared_buffers and it seems to me that we don't need that especially for not-so-big tables. We can optimize that path by looking into buff mapping table for the pages that exist in shared_buffers for the case when table size is less than some threshold (say 25%) of shared buffers. Attached patch implements the above idea and I found that performance doesn't dip much with patch even with large value of shared_buffers. I have also attached script and sql file used to take performance data. I'm marking this as returned with feedback in the commitfest. In addition to the issues raised so far, ISTM that the patch makes dropping a very large table with small shared_buffers slower (DropForkSpecificBuffers() is O(n) where n is the size of the relation, while the current method is O(shared_buffers)) There are no unresolved issues with the approach, nor is it true it is slower. If you think there are some, you should say what they are, not act high handedly to reject a patch without reason. Oh, I missed the NBuffers / 4 threshold. (The total_blocks calculation is prone to overflowing, btw, if you have a table close to 32 TB in size. But that's trivial to fix) I concur that we should explore using a radix tree or something else that would naturally allow you to find all buffers for relation/database X quickly. I doubt that it can be managed efficiently while retaining optimal memory management. If it can its going to be very low priority (or should be). This approach works, so lets do it, now. If someone comes up with a better way later, great. *shrug*. I don't think this is ready to be committed. I can't stop you from working on this, but as far as this commitfest is concerned, case closed. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 07/02/2015 04:18 PM, Amit Kapila wrote: On Thu, Jul 2, 2015 at 6:38 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/27/2015 07:45 AM, Amit Kapila wrote: Sometime back on one of the PostgreSQL blog [1], there was discussion about the performance of drop/truncate table for large values of shared_buffers and it seems that as the value of shared_buffers increase the performance of drop/truncate table becomes worse. I think those are not often used operations, so it never became priority to look into improving them if possible. I have looked into it and found that the main reason for such a behaviour is that for those operations it traverses whole shared_buffers and it seems to me that we don't need that especially for not-so-big tables. We can optimize that path by looking into buff mapping table for the pages that exist in shared_buffers for the case when table size is less than some threshold (say 25%) of shared buffers. Attached patch implements the above idea and I found that performance doesn't dip much with patch even with large value of shared_buffers. I have also attached script and sql file used to take performance data. I'm marking this as returned with feedback in the commitfest. In addition to the issues raised so far, Don't you think the approach discussed (delayed cleanup of buffers during checkpoint scan) is sufficient to fix the issues raised. Dunno, but there is no patch for that. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 2015-07-02 16:08:03 +0300, Heikki Linnakangas wrote: I'm marking this as returned with feedback in the commitfest. In addition to the issues raised so far, ISTM that the patch makes dropping a very large table with small shared_buffers slower (DropForkSpecificBuffers() is O(n) where n is the size of the relation, while the current method is O(shared_buffers)) I think upthread there was talk about only using the O(relsize) approach if relsize NBuffers. That's actually a pretty common scenario, especially in testsuites. I concur that we should explore using a radix tree or something else that would naturally allow you to find all buffers for relation/database X quickly. I unsurprisingly think that's the right way to go. But I'm not sure if it's not worth to add another layer of bandaid till were there... Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Thu, Jul 2, 2015 at 6:38 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/27/2015 07:45 AM, Amit Kapila wrote: Sometime back on one of the PostgreSQL blog [1], there was discussion about the performance of drop/truncate table for large values of shared_buffers and it seems that as the value of shared_buffers increase the performance of drop/truncate table becomes worse. I think those are not often used operations, so it never became priority to look into improving them if possible. I have looked into it and found that the main reason for such a behaviour is that for those operations it traverses whole shared_buffers and it seems to me that we don't need that especially for not-so-big tables. We can optimize that path by looking into buff mapping table for the pages that exist in shared_buffers for the case when table size is less than some threshold (say 25%) of shared buffers. Attached patch implements the above idea and I found that performance doesn't dip much with patch even with large value of shared_buffers. I have also attached script and sql file used to take performance data. I'm marking this as returned with feedback in the commitfest. In addition to the issues raised so far, Don't you think the approach discussed (delayed cleanup of buffers during checkpoint scan) is sufficient to fix the issues raised. ISTM that the patch makes dropping a very large table with small shared_buffers slower (DropForkSpecificBuffers() is O(n) where n is the size of the relation, while the current method is O(shared_buffers)) Can you please explain how did you reach that conclusion? It does only when relation size is less than 25% of shared buffers. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
Simon Riggs si...@2ndquadrant.com writes: On 2 July 2015 at 14:08, Heikki Linnakangas hlinn...@iki.fi wrote: I'm marking this as returned with feedback in the commitfest. There are no unresolved issues with the approach, nor is it true it is slower. If you think there are some, you should say what they are, not act high handedly to reject a patch without reason. Have you read the thread? There were plenty of objections, as well as a design for a better solution. In addition, we should think about more holistic solutions such as Andres' nearby proposal (which I doubt will work as-is, but maybe somebody will think of how to fix it). Committing a band-aid will just make it harder to pursue real fixes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 2 July 2015 at 14:08, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/27/2015 07:45 AM, Amit Kapila wrote: Sometime back on one of the PostgreSQL blog [1], there was discussion about the performance of drop/truncate table for large values of shared_buffers and it seems that as the value of shared_buffers increase the performance of drop/truncate table becomes worse. I think those are not often used operations, so it never became priority to look into improving them if possible. I have looked into it and found that the main reason for such a behaviour is that for those operations it traverses whole shared_buffers and it seems to me that we don't need that especially for not-so-big tables. We can optimize that path by looking into buff mapping table for the pages that exist in shared_buffers for the case when table size is less than some threshold (say 25%) of shared buffers. Attached patch implements the above idea and I found that performance doesn't dip much with patch even with large value of shared_buffers. I have also attached script and sql file used to take performance data. I'm marking this as returned with feedback in the commitfest. In addition to the issues raised so far, ISTM that the patch makes dropping a very large table with small shared_buffers slower (DropForkSpecificBuffers() is O(n) where n is the size of the relation, while the current method is O(shared_buffers)) There are no unresolved issues with the approach, nor is it true it is slower. If you think there are some, you should say what they are, not act high handedly to reject a patch without reason. I concur that we should explore using a radix tree or something else that would naturally allow you to find all buffers for relation/database X quickly. I doubt that it can be managed efficiently while retaining optimal memory management. If it can its going to be very low priority (or should be). This approach works, so lets do it, now. If someone comes up with a better way later, great. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Thu, Jul 2, 2015 at 7:27 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 07/02/2015 04:18 PM, Amit Kapila wrote: Don't you think the approach discussed (delayed cleanup of buffers during checkpoint scan) is sufficient to fix the issues raised. Dunno, but there is no patch for that. That's fair enough, however your reply sounded to me like there is no further need to explore this idea, unless we have something like radix tree based approach. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 06/27/2015 07:45 AM, Amit Kapila wrote: Sometime back on one of the PostgreSQL blog [1], there was discussion about the performance of drop/truncate table for large values of shared_buffers and it seems that as the value of shared_buffers increase the performance of drop/truncate table becomes worse. I think those are not often used operations, so it never became priority to look into improving them if possible. I have looked into it and found that the main reason for such a behaviour is that for those operations it traverses whole shared_buffers and it seems to me that we don't need that especially for not-so-big tables. We can optimize that path by looking into buff mapping table for the pages that exist in shared_buffers for the case when table size is less than some threshold (say 25%) of shared buffers. Attached patch implements the above idea and I found that performance doesn't dip much with patch even with large value of shared_buffers. I have also attached script and sql file used to take performance data. I'm marking this as returned with feedback in the commitfest. In addition to the issues raised so far, ISTM that the patch makes dropping a very large table with small shared_buffers slower (DropForkSpecificBuffers() is O(n) where n is the size of the relation, while the current method is O(shared_buffers)) I concur that we should explore using a radix tree or something else that would naturally allow you to find all buffers for relation/database X quickly. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 1 July 2015 at 15:39, Amit Kapila amit.kapil...@gmail.com wrote: Okay. I think we can maintain the list in similar way as we do for UNLINK_RELATION_REQUEST in RememberFsyncRequest(), but why to wait till 64 tables? I meant once per checkpoint cycle OR every N tables, whichever is sooner. N would need to vary according to size of Nbuffers. We already scan whole buffer list in each checkpoint cycle, so during that scan we can refer this dropped relation list and avoid syncing such buffer contents. Also for ENOENT error handling for FileWrite, we can use this list to refer relations for which we need to ignore the error. I think we are already doing something similar in mdsync to avoid the problem of Dropped tables, so it seems okay to have it in mdwrite as well. The crucial thing in this idea to think about is avoiding reassignment of relfilenode (due to wrapped OID's) before we have ensured that none of the buffers contains tag for that relfilenode. Currently we avoid this for Fsync case by retaining the first segment of relation (which will avoid reassignment of relfilenode) till checkpoint ends, I think if we just postpone it till we have validated it in shared_buffers, then we can avoid this problem in new scheme and it should be delay of maximum one checkpoint cycle for unlinking such file assuming we refer dropped relation list in each checkpoint cycle during buffer scan. Yes So you are keeping more data around for longer, right? I think we would need some way to trigger a scan when the amount of deferred dropped data files hits a certain size. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Tue, Jun 30, 2015 at 12:10 PM, Simon Riggs si...@2ndquadrant.com wrote: On 30 June 2015 at 07:34, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Jun 30, 2015 at 11:00 AM, Simon Riggs si...@2ndquadrant.com wrote: On 30 June 2015 at 05:02, Amit Kapila amit.kapil...@gmail.com wrote: On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs si...@2ndquadrant.com wrote: On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote: If lseek fails badly then SeqScans would give *silent* data loss, which in my view is worse. Just added pages aren't the only thing we might miss if lseek is badly wrong. So for the purpose of this patch, do we need to assume that lseek can give us wrong size of file and we should add preventive checks and other handling for the same? I am okay to change that way, if we are going to have that as assumption in out code wherever we are using it or will use it in-future, otherwise we will end with some preventive checks which are actually not required. They're preventative checks. You always hope it is wasted effort. I am not sure if Preventative checks (without the real need) are okay if they are not-cheap which could happen in this case. I think Validating buffer-tag would require rel or sys cache lookup. True, so don't do that. Keep a list of dropped relations and have the checkpoint process scan the buffer pool every 64 tables, kinda like AbsorbFsync Okay. I think we can maintain the list in similar way as we do for UNLINK_RELATION_REQUEST in RememberFsyncRequest(), but why to wait till 64 tables? We already scan whole buffer list in each checkpoint cycle, so during that scan we can refer this dropped relation list and avoid syncing such buffer contents. Also for ENOENT error handling for FileWrite, we can use this list to refer relations for which we need to ignore the error. I think we are already doing something similar in mdsync to avoid the problem of Dropped tables, so it seems okay to have it in mdwrite as well. The crucial thing in this idea to think about is avoiding reassignment of relfilenode (due to wrapped OID's) before we have ensured that none of the buffers contains tag for that relfilenode. Currently we avoid this for Fsync case by retaining the first segment of relation (which will avoid reassignment of relfilenode) till checkpoint ends, I think if we just postpone it till we have validated it in shared_buffers, then we can avoid this problem in new scheme and it should be delay of maximum one checkpoint cycle for unlinking such file assuming we refer dropped relation list in each checkpoint cycle during buffer scan. Does that make sense? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Wed, Jul 1, 2015 at 8:26 PM, Simon Riggs si...@2ndquadrant.com wrote: On 1 July 2015 at 15:39, Amit Kapila amit.kapil...@gmail.com wrote: Okay. I think we can maintain the list in similar way as we do for UNLINK_RELATION_REQUEST in RememberFsyncRequest(), but why to wait till 64 tables? I meant once per checkpoint cycle OR every N tables, whichever is sooner. N would need to vary according to size of Nbuffers. That sounds sensible to me, see my reply below what I think we can do for this to work. We already scan whole buffer list in each checkpoint cycle, so during that scan we can refer this dropped relation list and avoid syncing such buffer contents. Also for ENOENT error handling for FileWrite, we can use this list to refer relations for which we need to ignore the error. I think we are already doing something similar in mdsync to avoid the problem of Dropped tables, so it seems okay to have it in mdwrite as well. The crucial thing in this idea to think about is avoiding reassignment of relfilenode (due to wrapped OID's) before we have ensured that none of the buffers contains tag for that relfilenode. Currently we avoid this for Fsync case by retaining the first segment of relation (which will avoid reassignment of relfilenode) till checkpoint ends, I think if we just postpone it till we have validated it in shared_buffers, then we can avoid this problem in new scheme and it should be delay of maximum one checkpoint cycle for unlinking such file assuming we refer dropped relation list in each checkpoint cycle during buffer scan. Yes So you are keeping more data around for longer, right? Yes and we already do it for the sake of Fsyncs. I think we would need some way to trigger a scan when the amount of deferred dropped data files hits a certain size. Okay, I think we can keep it as if the number of dropped relations reached 64 or 0.01% (or 0.1%) of shared_buffers whichever is minimum as a point to trigger checkpoint. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Tue, Jun 30, 2015 at 11:00 AM, Simon Riggs si...@2ndquadrant.com wrote: On 30 June 2015 at 05:02, Amit Kapila amit.kapil...@gmail.com wrote: On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs si...@2ndquadrant.com wrote: On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote: If lseek fails badly then SeqScans would give *silent* data loss, which in my view is worse. Just added pages aren't the only thing we might miss if lseek is badly wrong. So for the purpose of this patch, do we need to assume that lseek can give us wrong size of file and we should add preventive checks and other handling for the same? I am okay to change that way, if we are going to have that as assumption in out code wherever we are using it or will use it in-future, otherwise we will end with some preventive checks which are actually not required. They're preventative checks. You always hope it is wasted effort. I am not sure if Preventative checks (without the real need) are okay if they are not-cheap which could happen in this case. I think Validating buffer-tag would require rel or sys cache lookup. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 30 June 2015 at 07:34, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Jun 30, 2015 at 11:00 AM, Simon Riggs si...@2ndquadrant.com wrote: On 30 June 2015 at 05:02, Amit Kapila amit.kapil...@gmail.com wrote: On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs si...@2ndquadrant.com wrote: On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote: If lseek fails badly then SeqScans would give *silent* data loss, which in my view is worse. Just added pages aren't the only thing we might miss if lseek is badly wrong. So for the purpose of this patch, do we need to assume that lseek can give us wrong size of file and we should add preventive checks and other handling for the same? I am okay to change that way, if we are going to have that as assumption in out code wherever we are using it or will use it in-future, otherwise we will end with some preventive checks which are actually not required. They're preventative checks. You always hope it is wasted effort. I am not sure if Preventative checks (without the real need) are okay if they are not-cheap which could happen in this case. I think Validating buffer-tag would require rel or sys cache lookup. True, so don't do that. Keep a list of dropped relations and have the checkpoint process scan the buffer pool every 64 tables, kinda like AbsorbFsync All the heavy lifting gets done in a background process and we know we're safe. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: On 27 June 2015 at 15:10, Tom Lane t...@sss.pgh.pa.us wrote: I don't like this too much because it will fail badly if the caller is wrong about the maximum possible page number for the table, which seems not exactly far-fetched. (For instance, remember those kernel bugs we've seen that cause lseek to lie about the EOF position?) If that is true, then our reliance on lseek elsewhere could also cause data loss, for example by failing to scan data during a seq scan. The lseek point was a for-example, not the entire universe of possible problem sources for this patch. (Also, underestimating the EOF point in a seqscan is normally not an issue since any rows in a just-added page are by definition not visible to the scan's snapshot. But I digress.) The consequences of failure of lseek in this case are nowhere near as dire, since by definition the data is being destroyed by the user. I'm not sure what you consider dire, but missing a dirty buffer belonging to the to-be-destroyed table would result in the system being permanently unable to checkpoint, because attempts to write out the buffer to the no-longer-extant file would fail. You could only get out of the situation via a forced database crash (immediate shutdown), followed by replaying all the WAL since the time of the problem. In production contexts that could be pretty dire. Yes, its bad, but we do notice that has happened. We can also put in code to specifically avoid this error at checkpoint time. If lseek fails badly then SeqScans would give *silent* data loss, which in my view is worse. Just added pages aren't the only thing we might miss if lseek is badly wrong. So, I think this patch still has legs. We can check that the clean up has been 100% when we do the buffer scan at the start of the checkpoint - that way we do just one scan of the buffer pool and move a time-consuming operation into a background process. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs si...@2ndquadrant.com wrote: On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote: I'm not sure what you consider dire, but missing a dirty buffer belonging to the to-be-destroyed table would result in the system being permanently unable to checkpoint, because attempts to write out the buffer to the no-longer-extant file would fail. You could only get out of the situation via a forced database crash (immediate shutdown), followed by replaying all the WAL since the time of the problem. In production contexts that could be pretty dire. Yes, its bad, but we do notice that has happened. We can also put in code to specifically avoid this error at checkpoint time. If lseek fails badly then SeqScans would give *silent* data loss, which in my view is worse. Just added pages aren't the only thing we might miss if lseek is badly wrong. So for the purpose of this patch, do we need to assume that lseek can give us wrong size of file and we should add preventive checks and other handling for the same? I am okay to change that way, if we are going to have that as assumption in out code wherever we are using it or will use it in-future, otherwise we will end with some preventive checks which are actually not required. Another idea here is that use some other way instead of lseek to know the size of file. Do you think we can use stat() for this purpose, we are already using it in fd.c? So, I think this patch still has legs. We can check that the clean up has been 100% when we do the buffer scan at the start of the checkpoint One way to ensure that is verify that each buffer header tag is valid (which essentially means whether the object exists), do you have something else in mind to accomplish this part if we decide to go this route? - that way we do just one scan of the buffer pool and move a time-consuming operation into a background process. Agreed and if that turns out to be cheap, then we might want to use some way to optimize Drop Database and others in same way. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Mon, Jun 29, 2015 at 5:41 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Sun, Jun 28, 2015 at 9:05 PM, Tom Lane t...@sss.pgh.pa.us wrote: Amit Kapila amit.kapil...@gmail.com writes: On Sat, Jun 27, 2015 at 7:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: I don't like this too much because it will fail badly if the caller is wrong about the maximum possible page number for the table, which seems not exactly far-fetched. (For instance, remember those kernel bugs we've seen that cause lseek to lie about the EOF position?) Considering we already have exclusive lock while doing this operation and nobody else can perform write on this file, won't closing and opening it again would avoid such problems. On what grounds do you base that touching faith? Sorry, but I don't get what problem do you see in this touching? On again thinking about it, I think you are worried that if we close and reopen the file, it would break any flush operation that could happen in parallel to it via checkpoint. Still I am not clear that do we want to assume that we can't rely on lseek for the size of file if there can be parallel write activity on file (even if that write doesn't increase the size of file)? If yes, then we have below options: a. Have some protection mechanism for File Access (ignore error for file not present or accessed during flush) and clean the buffers buffers containing invalid objects as is being discussed up-thread. b. Use some other API like stat to get the size of file? Do you prefer any of these or if you have any better idea, then please do share the same? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 30 June 2015 at 05:02, Amit Kapila amit.kapil...@gmail.com wrote: On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs si...@2ndquadrant.com wrote: On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote: I'm not sure what you consider dire, but missing a dirty buffer belonging to the to-be-destroyed table would result in the system being permanently unable to checkpoint, because attempts to write out the buffer to the no-longer-extant file would fail. You could only get out of the situation via a forced database crash (immediate shutdown), followed by replaying all the WAL since the time of the problem. In production contexts that could be pretty dire. Yes, its bad, but we do notice that has happened. We can also put in code to specifically avoid this error at checkpoint time. If lseek fails badly then SeqScans would give *silent* data loss, which in my view is worse. Just added pages aren't the only thing we might miss if lseek is badly wrong. So for the purpose of this patch, do we need to assume that lseek can give us wrong size of file and we should add preventive checks and other handling for the same? I am okay to change that way, if we are going to have that as assumption in out code wherever we are using it or will use it in-future, otherwise we will end with some preventive checks which are actually not required. They're preventative checks. You always hope it is wasted effort. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Sat, Jun 27, 2015 at 11:38 AM, Andres Freund and...@anarazel.de wrote: On 2015-06-27 10:10:04 -0400, Tom Lane wrote: In the past we've speculated about fixing the performance of these things by complicating the buffer lookup mechanism enough so that it could do find any page for this table/tablespace/database efficiently. Nobody's had ideas that seemed sane performance-wise though. I've started to play around with doing that a year or three back. My approach was to use a linux style radix tree for the buffer mapping table. Besides lack of time what made it hard to be efficient was the size of our buffer tags requiring rather deep trees. I think I was considering playing around with a two-level tree (so we could cache a pointer in Relation or such), but the memory management requirements for that made my head hurt too much. The other alternative is to work on having a much simpler buffer tag Wouldn't even a two-level tree have the same problem you complained about vis-a-vis chash? In that case, you were of the opinion that even ONE extra level of indirection was enough to pinch. (I'm still hoping there is a way to fix that, but even so.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
Simon Riggs si...@2ndquadrant.com writes: On 27 June 2015 at 15:10, Tom Lane t...@sss.pgh.pa.us wrote: I don't like this too much because it will fail badly if the caller is wrong about the maximum possible page number for the table, which seems not exactly far-fetched. (For instance, remember those kernel bugs we've seen that cause lseek to lie about the EOF position?) If that is true, then our reliance on lseek elsewhere could also cause data loss, for example by failing to scan data during a seq scan. The lseek point was a for-example, not the entire universe of possible problem sources for this patch. (Also, underestimating the EOF point in a seqscan is normally not an issue since any rows in a just-added page are by definition not visible to the scan's snapshot. But I digress.) The consequences of failure of lseek in this case are nowhere near as dire, since by definition the data is being destroyed by the user. I'm not sure what you consider dire, but missing a dirty buffer belonging to the to-be-destroyed table would result in the system being permanently unable to checkpoint, because attempts to write out the buffer to the no-longer-extant file would fail. You could only get out of the situation via a forced database crash (immediate shutdown), followed by replaying all the WAL since the time of the problem. In production contexts that could be pretty dire. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 06/27/2015 10:10 AM, Tom Lane wrote: It also offers no hope of a fix for the other operations that scan the whole buffer pool, such as DROP TABLESPACE and DROP DATABASE. Improving DROP TABLE / TRUNCATE would still be a significant advance. These cases cause far more real world pain than the ones you mention, in my experience. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
Amit Kapila amit.kapil...@gmail.com writes: On Sat, Jun 27, 2015 at 7:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: I don't like this too much because it will fail badly if the caller is wrong about the maximum possible page number for the table, which seems not exactly far-fetched. (For instance, remember those kernel bugs we've seen that cause lseek to lie about the EOF position?) Considering we already have exclusive lock while doing this operation and nobody else can perform write on this file, won't closing and opening it again would avoid such problems. On what grounds do you base that touching faith? Quite aside from outright bugs, having lock on a table has nothing to do with whether low-level processes such as the checkpointer can touch it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 2015-06-28 09:11:29 -0400, Robert Haas wrote: On Sat, Jun 27, 2015 at 11:38 AM, Andres Freund and...@anarazel.de wrote: I've started to play around with doing that a year or three back. My approach was to use a linux style radix tree for the buffer mapping table. Besides lack of time what made it hard to be efficient was the size of our buffer tags requiring rather deep trees. I think I was considering playing around with a two-level tree (so we could cache a pointer in Relation or such), but the memory management requirements for that made my head hurt too much. The other alternative is to work on having a much simpler buffer tag Wouldn't even a two-level tree have the same problem you complained about vis-a-vis chash? I was hoping to avoid the upper tree (mapping relfilenodes to the block tree) in the majority of the cases by caching that mapping in struct Relation or so. But generally, yes, a tree will have more indirections. But they're often of a different quality than with a hash table. There's a high amount of spatial locality when looking up blocks: You're much more likely to lookup a block close to one recently looked up than just a randomly different one. Hashtables don't have a way to benefit from that - tree structures sometimes do. I don't think using a radix tree in itself will have significant performance benefits over the hash table. But it allows for a bunch of cool further optimizations like the aforementioned 'intelligent' readahead, combining writes when flushing buffers, and - which made me look into it originally - tagging inner nodes in the tree with information about dirtyness to avoid scanning large amounts of nondirty buffers. In that case, you were of the opinion that even ONE extra level of indirection was enough to pinch. (I'm still hoping there is a way to fix that, but even so.) Me too. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 27 June 2015 at 15:10, Tom Lane t...@sss.pgh.pa.us wrote: Amit Kapila amit.kapil...@gmail.com writes: I have looked into it and found that the main reason for such a behaviour is that for those operations it traverses whole shared_buffers and it seems to me that we don't need that especially for not-so-big tables. We can optimize that path by looking into buff mapping table for the pages that exist in shared_buffers for the case when table size is less than some threshold (say 25%) of shared buffers. I don't like this too much because it will fail badly if the caller is wrong about the maximum possible page number for the table, which seems not exactly far-fetched. (For instance, remember those kernel bugs we've seen that cause lseek to lie about the EOF position?) It also offers no hope of a fix for the other operations that scan the whole buffer pool, such as DROP TABLESPACE and DROP DATABASE. If that is true, then our reliance on lseek elsewhere could also cause data loss, for example by failing to scan data during a seq scan. The consequences of failure of lseek in this case are nowhere near as dire, since by definition the data is being destroyed by the user. So ISTM that we should be able to use this technique. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Sun, Jun 28, 2015 at 9:05 PM, Tom Lane t...@sss.pgh.pa.us wrote: Amit Kapila amit.kapil...@gmail.com writes: On Sat, Jun 27, 2015 at 7:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: I don't like this too much because it will fail badly if the caller is wrong about the maximum possible page number for the table, which seems not exactly far-fetched. (For instance, remember those kernel bugs we've seen that cause lseek to lie about the EOF position?) Considering we already have exclusive lock while doing this operation and nobody else can perform write on this file, won't closing and opening it again would avoid such problems. On what grounds do you base that touching faith? Sorry, but I don't get what problem do you see in this touching? Quite aside from outright bugs, having lock on a table has nothing to do with whether low-level processes such as the checkpointer can touch it. I thought that this problem (lseek lie about EOF) would only occur if there is a recent extension to the file or does mere writes to existing pages could also cause this problem? Though we should ideally take care of any failures of OS API's especially if they could lead to data loss, however not sure if we can do that reliably in all possible cases. Can we safely guarantee that as we haven't encountered any other such problem in any other API, so everything is good. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Sun, Jun 28, 2015 at 9:47 PM, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: On 27 June 2015 at 15:10, Tom Lane t...@sss.pgh.pa.us wrote: I don't like this too much because it will fail badly if the caller is wrong about the maximum possible page number for the table, which seems not exactly far-fetched. (For instance, remember those kernel bugs we've seen that cause lseek to lie about the EOF position?) If that is true, then our reliance on lseek elsewhere could also cause data loss, for example by failing to scan data during a seq scan. The lseek point was a for-example, not the entire universe of possible problem sources for this patch. (Also, underestimating the EOF point in a seqscan is normally not an issue since any rows in a just-added page are by definition not visible to the scan's snapshot. How do we ensure that just-added page is before or after the scan's snapshot? If it is before, then the above point mentioned by Simon is valid. Does this mean that all other usages of smgrnblocks()/mdnblocks() is safe with respect to this issue or the consequences will not be so bad as for this usage? But I digress.) The consequences of failure of lseek in this case are nowhere near as dire, since by definition the data is being destroyed by the user. I'm not sure what you consider dire, but missing a dirty buffer belonging to the to-be-destroyed table would result in the system being permanently unable to checkpoint, because attempts to write out the buffer to the no-longer-extant file would fail. So another idea here could be that if instead of failing, we just ignore the error in case the the object (to which that page belongs) doesn't exist and we can make Drop free by not invalidating from shared_buffers in case of Drop/Truncate. I think this might not be sane idea as we need to have a way to do lookup of objects from checkpoint and need to handle the case where same Oid could be assigned to new objects (after wraparound?). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Sun, Jun 28, 2015 at 12:17 PM, Tom Lane t...@sss.pgh.pa.us wrote: I'm not sure what you consider dire, but missing a dirty buffer belonging to the to-be-destroyed table would result in the system being permanently unable to checkpoint, because attempts to write out the buffer to the no-longer-extant file would fail. You could only get out of the situation via a forced database crash (immediate shutdown), followed by replaying all the WAL since the time of the problem. In production contexts that could be pretty dire. Hmm, that is kind of ugly. Is the write actually going to fail, though, or is it just going to create a sparse file? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
Amit Kapila amit.kapil...@gmail.com writes: I have looked into it and found that the main reason for such a behaviour is that for those operations it traverses whole shared_buffers and it seems to me that we don't need that especially for not-so-big tables. We can optimize that path by looking into buff mapping table for the pages that exist in shared_buffers for the case when table size is less than some threshold (say 25%) of shared buffers. I don't like this too much because it will fail badly if the caller is wrong about the maximum possible page number for the table, which seems not exactly far-fetched. (For instance, remember those kernel bugs we've seen that cause lseek to lie about the EOF position?) It also offers no hope of a fix for the other operations that scan the whole buffer pool, such as DROP TABLESPACE and DROP DATABASE. In the past we've speculated about fixing the performance of these things by complicating the buffer lookup mechanism enough so that it could do find any page for this table/tablespace/database efficiently. Nobody's had ideas that seemed sane performance-wise though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Fri, Jun 26, 2015 at 9:45 PM, Amit Kapila amit.kapil...@gmail.com wrote: Sometime back on one of the PostgreSQL blog [1], there was discussion about the performance of drop/truncate table for large values of shared_buffers and it seems that as the value of shared_buffers increase the performance of drop/truncate table becomes worse. I think those are not often used operations, so it never became priority to look into improving them if possible. I have looked into it and found that the main reason for such a behaviour is that for those operations it traverses whole shared_buffers and it seems to me that we don't need that especially for not-so-big tables. We can optimize that path by looking into buff mapping table for the pages that exist in shared_buffers for the case when table size is less than some threshold (say 25%) of shared buffers. Attached patch implements the above idea and I found that performance doesn't dip much with patch even with large value of shared_buffers. I have also attached script and sql file used to take performance data. +1 for the effort to improve this. With your technique added, there are 3 possible ways the search can happen a) Scan NBuffers and scan list of relations, b) Scan NBuffers and bsearch list of relations, and c) Scan list of relations and then invalidate blocks of each fork from shared buffers. Would it be worth it finding one technique that can serve decently from the low-end shared_buffers to the high-end. On patch: There are multiple naming styles being used in DropForkSpecificBuffers(); my_name and myName. Given this is a new function, it'd help to be consistent. s/blk_count/blockNum/ s/new//, for eg. newTag, because there's no corresponding tag/oldTag variable in the function. s/blocksToDel/blocksToDrop/. BTW, we never pass anything other than the total number of blocks in the fork, so we may as well call it just numBlocks. s/traverse_buf_freelist/scan_shared_buffers/, because when it is true, we scan the whole shared_buffers. s/rel_count/rel_num/ Reduce indentation/tab in header-comments of DropForkSpecificBuffers(). But I see there's precedent in neighboring functions, so this may be okay. Doing pfree() of num_blocks, num_fsm_blocks and num_vm_blocks in one place (instead of two, at different indentation levels) would help readability. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Sat, Jun 27, 2015 at 8:06 PM, Gurjeet Singh gurj...@singh.im wrote: On Fri, Jun 26, 2015 at 9:45 PM, Amit Kapila amit.kapil...@gmail.com wrote: Attached patch implements the above idea and I found that performance doesn't dip much with patch even with large value of shared_buffers. I have also attached script and sql file used to take performance data. +1 for the effort to improve this. Thanks. With your technique added, there are 3 possible ways the search can happen a) Scan NBuffers and scan list of relations, b) Scan NBuffers and bsearch list of relations, and c) Scan list of relations and then invalidate blocks of each fork from shared buffers. Would it be worth it finding one technique that can serve decently from the low-end shared_buffers to the high-end. Yes, it would be great if we could have any such technique, but in absence of that current optimization would suffice the need unless there are any loop-holes in it. On patch: There are multiple naming styles being used in DropForkSpecificBuffers(); my_name and myName. Given this is a new function, it'd help to be consistent. Thanks for suggestions, I will improve the patch on those lines if there is no problem with idea at broad level. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Sat, Jun 27, 2015 at 7:40 PM, Tom Lane t...@sss.pgh.pa.us wrote: Amit Kapila amit.kapil...@gmail.com writes: I have looked into it and found that the main reason for such a behaviour is that for those operations it traverses whole shared_buffers and it seems to me that we don't need that especially for not-so-big tables. We can optimize that path by looking into buff mapping table for the pages that exist in shared_buffers for the case when table size is less than some threshold (say 25%) of shared buffers. I don't like this too much because it will fail badly if the caller is wrong about the maximum possible page number for the table, which seems not exactly far-fetched. (For instance, remember those kernel bugs we've seen that cause lseek to lie about the EOF position?) Considering we already have exclusive lock while doing this operation and nobody else can perform write on this file, won't closing and opening it again would avoid such problems. In patch that is already done (smgrexists()) for 2 kind of forks and can be done for the third kind as well. It also offers no hope of a fix for the other operations that scan the whole buffer pool, such as DROP TABLESPACE and DROP DATABASE. True, but it is not foreclosing if any body has idea to optimize those paths. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com