Re: [HACKERS] Patch Review: Bugfix for XPATH() if text or attribute nodes are selected
On Thu, 14 Jul 2011 15:15:33 +0300, Peter Eisentraut wrote: On sön, 2011-07-10 at 11:40 -0700, Josh Berkus wrote: Hackers, B. 6. Current behaviour _is intended_ (there is if to check node type) and _natural_. In this particular case user ask for text content of some node, and this content is actually . I don't buy that. The check for the node type is there because two different libxml functions are used to convert nodes to strings. The if has absolutely *zero* to do with escaping, expect for that missing escape_xml() call in the else case. Secondly, there is little point in having an type XML if we don't actually ensure that values of that type can only contain well-formed XML. Can anyone else weigh in on this? Peter? Looks like a good change to me. I'll bump it in few hours, as I can't recall password from keyring. Now I have hands clean and it's not my business to care about this. Best regards. Radek. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch Review: Bugfix for XPATH() if text or attribute nodes are selected
On Tue, 12 Jul 2011 11:11:40 -0700, Josh Berkus wrote: Radoslaw, For me this discussion is over. I putted my objections and suggestions. Full review is available in archives, and why to not escape is putted in review of your 2nd patch, about scalar values. Did you install and test the functionality of the patch? I can't tell from your review whether you got that far. Yas, patch was tested, and applied. I think in both reviews I marked that patch does this what Florain said. Examples of double escaping were taken from one of patch (I think from this about scalar values). If I should I will bump this patches up. Regards, Radosław Smogura -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch Review: Bugfix for XPATH() if text or attribute nodes are selected
On Sun, 10 Jul 2011 17:06:22 -0500, Robert Haas wrote: On Jul 10, 2011, at 1:40 PM, Josh Berkus j...@agliodbs.com wrote: Hackers, B. 6. Current behaviour _is intended_ (there is if to check node type) and _natural_. In this particular case user ask for text content of some node, and this content is actually . I don't buy that. The check for the node type is there because two different libxml functions are used to convert nodes to strings. The if has absolutely *zero* to do with escaping, expect for that missing escape_xml() call in the else case. Secondly, there is little point in having an type XML if we don't actually ensure that values of that type can only contain well-formed XML. Can anyone else weigh in on this? Peter? Unless I am missing something, Florian is clearly correct here. ...Robert For me not, because this should be fixed internally by making xml type sefe, currently xml type may be used to keep proper XMLs and any kind of data, as well. If I ask, by any means select xpath(/text(...)). I want to get text. 1) How I should descape node in client application (if it's part of xml I don't have header), bear in mind XML must give support for streaming processing too. 2) Why I should differntly treat text() then select from varchar in both I ask for xml, driver can't make this, because it doesn't know if it gets scalar, text, comment, element, or maybe document. 3) What about current applications, folks probably uses this and are happy they get text, and will not see, that next release of PostgreSQL will break their applications. There is of course disadvantage of current behaviour as it may lead to inserting badly xmls (in one case), but I created example when auto escaping will create double escaped xmls, and may lead to insert inproper data (this is about 2nd patch where Florian add escaping, too). SELECT XMLELEMENT(name root, XMLATTRIBUTES(foo.namespace AS sth)) FROM (SELECT (XPATH('namespace-uri(/*)', x))[1] AS namespace FROM (VALUES (XMLELEMENT(name root, XMLATTRIBUTES('n' AS xmlns, 'v' AS value),'t'))) v(x)) as foo; xmlelement - root sth=amp;lt;n/ It can't be resolved without storing type in xml or adding xmltext or adding pseudo xmlany element, which will be returned by xpath. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch Review: Bugfix for XPATH() if text or attribute nodes are selected
On Tue, 12 Jul 2011 11:45:59 +0200, Florian Pflug wrote: On Jul12, 2011, at 11:00 , Radosław Smogura wrote: On Sun, 10 Jul 2011 17:06:22 -0500, Robert Haas wrote: Unless I am missing something, Florian is clearly correct here. For me not, because this should be fixed internally by making xml type sefe Huh??. Making the xml type safe is *exactly* what I'm trying to do here... currently xml type may be used to keep proper XMLs and any kind of data, as well. As I pointed out before, that simply isn't true. Try storing non-well-formed data into an XML column (there *are* ways to do that, i.e. there are bugs, one if which I'm trying to fix here!) and then dump and (try to) reload your database. Ka-wom! If I ask, by any means select xpath(/text(...)). I want to get text. And I want '3' || '4' to return the integer 34. Though luck! The fact that XPATH() is declared to return XML, *not* TEXT means you don't get what you want. Period. Feel free to provide a patch that adds a function XPATH_TEXT if you feel this is an issue. XML *is* *not* simply an alias for TEXT! It's a distinct type, which its down distinct rules about what constitutes a valid value and what doesn't. 1) How I should descape node in client application (if it's part of xml I don't have header), bear in mind XML must give support for streaming processing too. Huh? 2) Why I should differntly treat text() then select from varchar in both I ask for xml, driver can't make this, because it doesn't know if it gets scalar, text, comment, element, or maybe document. 3) What about current applications, folks probably uses this and are happy they get text, and will not see, that next release of PostgreSQL will break their applications. That, and *only* that, I recognize as a valid concern. However, and *again* as I have pointer out before a *multiple* of times, backwards compatibility is no excuse not to fix bugs. Plus, there might just as well be applications which feed the contents of XML columns directly into a XML parser (as they have every right to!) and don't expect that parser to throw an error. Which, as it stands, we cannot guarantee. Having to deal with an error there is akin to having to deal with integer columns containing 'foobar'! Bugs must be resolved in smart way, especially if they changes behaviour, with consideration of impact change will produce, removing support for xml resolves this bug as well. I've said problem should be resolved in different way. There is of course disadvantage of current behaviour as it may lead to inserting badly xmls (in one case), but I created example when auto escaping will create double escaped xmls, and may lead to insert inproper data (this is about 2nd patch where Florian add escaping, too). SELECT XMLELEMENT(name root, XMLATTRIBUTES(foo.namespace AS sth)) FROM (SELECT (XPATH('namespace-uri(/*)', x))[1] AS namespace FROM (VALUES (XMLELEMENT(name root, XMLATTRIBUTES('n' AS xmlns, 'v' AS value),'t'))) v(x)) as foo; xmlelement - root sth=amp;lt;n/ Radosław, you've raised that point before, and I refuted it. The crucial difference is that double-escaped values are well-formed, where as un-escaped ones are not. Again, as I said before, the double-escaping done by XMLATTRIBUTES there is not pretty. But its *not* XPATH()'s fault!. To see that, simply replace your XPATH() expression with 'lt;n'::xml to see that. And in fact It can't be resolved without storing type in xml or adding xmltext or adding pseudo xmlany element, which will be returned by xpath. Huh? Frankly, Radosław, I get the feeling that you're not trying to understand my answers to your objections, but instead keep repeating the same assertions over and over again. Even though at least some of them, like XML being able to store arbitrary values, are simply wrong! And I'm getting pretty tired of this... So far, you also don't seem to have taken a single look at the actual implementation of the patch, even though code review is an supposed to be an integral part of the patch review process. I therefore don't believe that we're getting anywhere here. So far, you don't know if I taken a single look, your suspicious are wrong, and You try to blame me. All of your sentences about do not understanding I may sent to you, and blame you with your words. So please either start reviewing the actual implementation, and leave the considerations about whether we want this or not to the eventual committer. Or, if you don't want to do that for one reason or another, pleaser consider letting somebody else take over this review, i.e. consider removing your name from the Reviewer field. If I do review I may put my comments, but I get the feeling that you're not trying to understand my answers to your objections, but instead keep repeating the same assertions over and over again. - and in patch there is review of code. So please either
Re: [HACKERS] Crash dumps
Craig Ringer cr...@postnewspapers.com.au Thursday 07 of July 2011 01:05:48 On 6/07/2011 11:00 PM, Radosław Smogura wrote: I think IPC for fast shout down all backends and wait for report processing is quite enaugh. How do you propose to make that reliable, though? -- Craig Ringer POST Newspapers 276 Onslow Rd, Shenton Park Ph: 08 9381 3088 Fax: 08 9388 2258 ABN: 50 008 917 717 http://www.postnewspapers.com.au/ I want to add IPC layer to postgresql, few approches may be considerable, 1. System IPC 2. Urgent data on socket 3. Sockets (at least descriptors) + threads 4. Not portable, fork in segfault (I think forked process should start in segfault too). I actualy think for 3. sockets (on Linux pipes) + threads will be the best and more portable, for each backend PostgreSQL will open two channels urgent and normal, for each channel a thread will be spanned and this thread will just wait for data, backend will not start if it didn't connected to postmaster. Some security must be accounted when opening plain socket. In context of crash, segfault sends information on urgent channel, and postmaster kills all backends except sender, giving it time to work in segfault. Normal channels may, be used for scheduling some async operations, like read next n-blocks when sequence scan started. By the way getting reports on segfault isn't something unusal, Your favorite software Java(tm) Virtual Machine does it. Regards, Radosław Smogura -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash dumps
On Wed, 06 Jul 2011 07:59:12 +0800, Craig Ringer wrote: On 5/07/2011 9:05 PM, Magnus Hagander wrote: On Tue, Jul 5, 2011 at 15:02, Robert Haasrobertmh...@gmail.com wrote: On Mon, Jul 4, 2011 at 12:47 PM, Radosław Smogura rsmog...@softperience.eu wrote: I asked about crash reports becaus of at this time there was thread about crashing in live system. Yeah, I thought this was the result of that effort: [snip] That crash dump is basically the windows equivalent of a coredump, though. Just a different name... Yup, it's a cut-down core dump. In this case generated in-process by the crashing backend. It'd be nice to be able to generate the crash dump from out-of-process. Unfortunately, the automatic crash dump generation system on Windows doesn't appear to be available to system services running non-interactively. Not that I could see, anyway. As a result we had to trap the crashes within the crashing process and generate the dump from there. As previously stated, doing anything within a segfaulting process is unreliable, so it's not the best approach in the world. All I was saying in this thread is that it'd be nice to have a way for a crashing backend to request that another process capture diagnostic information from it before it exits with a fault, so it doesn't have to try to dump its self. As Tom said, though, anything like that is more likely to decrease the reliability of the overall system. You don't want a dead backend hanging around forever waiting for the postmaster to act on it, and you *really* don't want other backends still alive and potentially writing from shm that's in in who-knows-what state while the postmaster is busy fiddling with a crashed backend. So, overall, I think dump a simple core and die as quickly as possible is the best option. That's how it already works on UNIX, and all the win32 crash dump patches do is make it work on Windows too. -- Craig Ringer POST Newspapers 276 Onslow Rd, Shenton Park Ph: 08 9381 3088 Fax: 08 9388 2258 ABN: 50 008 917 717 http://www.postnewspapers.com.au/ Personally I will not send core dump to anyone, core dump may not only contain sensible information from postmaster, but from other application too. Btw, I just take core dump form postmaster, I found there some dns addresses I connected before from bash. Postamster should not see it. I think IPC for fast shout down all backends and wait for report processing is quite enaugh. Regards, Radosław Smogura -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash dumps
On Mon, 04 Jul 2011 12:58:46 +0800, Craig Ringer wrote: On 15/06/2011 2:37 AM, Radosław Smogura wrote: Hello, Because, I work a little bit on streaming protocol and from time to time I have crashes. I want ask if you wont crash reporting (this is one of minors products from mmap playing) those what I have there is mmaped areas, and call stacks, and some other stuff. Core files already contain all that, don't they? They omit shared memory segments by default on most platforms, but should otherwise be quite complete. The usual approach on UNIXes and linux is to use the built-in OS features to generate a core dump of a crashing process then analyze it after the fact. That way the crash is over as fast as possible and you can get services back up and running before spending the time, CPU and I/O required to analyze the core dump. Actually this, what I was thinking about was, to add dumping of GUC, etc. List of mappings came from when I tired to mmap PostgreSQL, and due to many of errors, which sometimes occurred in unexpected places, I was in need to add something that will be useful for me and easy to analyse (I could simple find pointer, and then check which region failed). The idea to try to evolve this come later. I think report should looks like: This is crash report of PostgreSQL database, generated on Here is list of GUC variables: Here is list of files: Here is backtrace: Here is detailed backtrace: Here is list of m-mappings (you may get what library are linked in) Here is your free memory Here is your disk usage Here is your custom addition This based reports works for Linux with gdb, but there is some pluggable architecture, which connects with segfault Which process does the debugging? Does the crashing process fork() a copy of gdb to debug its self? One thing I've been interested in is giving the postmaster (or more likely a helper for the postmaster) the ability to handle backend is crashing messages, attach a debugger to the crashing backend and generate a dump and/or backtrace. This might be workable in cases where in-process debugging can't be done due to a smashed stack, full heap causing malloc() failure, etc. Currently I do everything in segfault handler (no fork), but I like the idea of fork (in segfault), this may resolve some problems. Regards, Radosław Smogura -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Crash dumps
Tom Lane t...@sss.pgh.pa.us Monday 04 of July 2011 16:32:32 Craig Ringer cr...@postnewspapers.com.au writes: Why not produce a tool that watches the datadir for core files and processes them? ... By and large, our attitude has been that Postgres shouldn't be crashing often enough to make this sort of infrastructure worthwhile. Developer time spent on it would be far better spent on fixing the bugs instead. For that reason, it'd be handy if a backend could trap SIGSEGV and reliably tell the postmaster I'm crashing! so the postmaster could fork a helper to capture any additional info the backend needs to be alive for. ... Unfortunately, reliably and segfault don't go together. Yeah. I think there's no chance at all that we'd accept patches pointed in this direction. They'd be more likely to decrease the system's reliability than anything else. Aside from the difficulty of doing anything at all reliably in an already-failing process, once we realize that something is badly wrong it's important to kill off all other backends ASAP. That reduces the window for any possible corruption of shared memory to make it into on-disk state. So interposing a helper to fool around with the failed process doesn't sound good at all. In practice I think you can generally get everything of interest out of the core file, so it's not clear to me that there's any win available from this line of thought anyhow. regards, tom lane I asked about crash reports becaus of at this time there was thread about crashing in live system. There is one win, I think, users will faster send crash report, then core dump (from many reasons, size, number of confidential information, etc). Such report may be quite usefull, for reasonable bug finding. It may be attached as contrib too, but I noticed and we agree that report generation should not affect speed of shoutdown. PostgreSQL will look better - server application that generates crash reports looks better then no generating. Just a bit of marketing ;) Regards, Radek. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Review of patch Bugfix for XPATH() if expression returns a scalar value
On Wed, 29 Jun 2011 22:26:39 +0200, Florian Pflug wrote: On Jun29, 2011, at 19:57 , Radosław Smogura wrote: This is review of patch https://commitfest.postgresql.org/action/patch_view?id=565 Bugfix for XPATH() if expression returns a scalar value SELECT XMLELEMENT(name root, XMLATTRIBUTES(foo.namespace AS sth)) FROM (SELECT (XPATH('namespace-uri(/*)', x))[1] AS namespace FROM (VALUES (XMLELEMENT(name root, XMLATTRIBUTES('n' AS xmlns, 'v' AS value),'t'))) v(x)) as foo; xmlelement - root sth=amp;lt;n/ It's clearly visible that value from attribute is lt;n, not . Every parser will read this as lt;n which is not-natural and will require form consumer/developer to de-escape this on his side - roughly speaking this will be reported as serious bug. There's a problem there, no doubt, but your proposed solution just moves the problem around. Here's your query, reformatted to be more legible SELECT XMLELEMENT(name root, XMLATTRIBUTES(foo.namespace AS sth) ) FROM ( SELECT (XPATH('namespace-uri(/*)', x))[1] AS namespace FROM (VALUES (XMLELEMENT(name root, XMLATTRIBUTES('n' AS xmlns, 'v' AS value),'t') ) ) v(x)) as foo; What happens is that the XPATH expression selects the xmlns attribute with the value 'n'. To be well-formed xml, that value must be escaped, so what is actually returned by the XPATH call is 'lt;n'. The XMLATTRIBUTES() function, however, doesn't distinguish between input of type XML and input of type TEXT, so it figures it has to represent the *textual* value 'lt;n' in xml, and produces 'amp;lt;n'. Not escaping textual values returned by XPATH isn't a solution, though. For example, assume someone inserts the value returned by the XPATH() call in your query into a column of type XML. If XPATH() returned 'n' literally instead of escaping it, the column would contain non-well-formed XML in a column of type XML. Not pretty, and pretty fatal during your next dump/reload cycle. Or, if you want another example, simply remove the XMLATTRIBUTES call and use the value returned by XPATH as a child node directly. SELECT XMLELEMENT(name root, foo.namespace ) FROM ( SELECT (XPATH('namespace-uri(/*)', x))[1] AS namespace FROM (VALUES (XMLELEMENT(name root, XMLATTRIBUTES('n' AS xmlns, 'v' AS value),'t') ) ) v(x)) as foo; xmlelement rootlt;n/root Note that this correct! The root node contains a text node representing the textual value 'n'. If XPATH() hadn't return the value escaped, the query above would have produces rootn/root which is obviously wrong. It works in this case because XMLELEMENT is smart enough to distinguish between child nodes gives as TEXT values (those are escaped) and child nodes provided as XML values (those are inserted unchanged). XMLATTRIBUTES, however, doesn't make that distinction, and simply escaped all attributes values independent of their type. Now, the reason it does that is probably that *not* all XML values are well-formed attribute values without additional escaping. For example, you cannot have literal '' and '' in attribute values. So if XMLATTRIBUTES, like XMLELEMENT never escaped values which are already of type XML, the following piece of code XMLELEMENT(name a, XMLATTRIBUTES('node/'::xml as v)) would produce a v=node// which, alas, is not well-formed :-( The correct solution, I believe, is to teach XMLATTRIBUTES to only escape those characters which are invalid in attribute values but valid in XML (Probably only '' and '' but I'll check). If there are no objects, I'll prepare a separate patch for that. I don't want to include it in this patch because it's really a distinct issue (even modifies a different function). best regards, Florian Pflug You may manually enter invalid xml too, You don't need xpath for this. Much more In PostgreSQL 9.0.1, compiled by Visual C++ build 1500, 64-bit I executed SELECT XMLELEMENT(name a, XMLATTRIBUTES('node/'::xml as v)) and it's looks like I got correct output a v=lt;node/gt;/ when I looked with text editor into table files I saw same value. I will check on last master if I can reproduce this. But indeed, now PostgreSQL is not type safe against XML type. See SELECT XMLELEMENT(name root, '', (xpath('//text()', 'rootlt;/root'))[1])). You found some problem with escaping, but solution is bad, I think problem lies with XML type, which may be used to hold strings and proper xml. For example above query can't distinguish if (xpath('//text()', 'rootlt;/root'))[1] is xml text, or xml element. For me adding support for scalar is really good and needed, but escaping is not. Regards, Radosław Smogura -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Patch Review: Bugfix for XPATH() if text or attribute nodes are selected
Review of patch https://commitfest.postgresql.org/action/patch_view?id=580 === Patch description === SUMMARY: When text() based XPATH expression is invoked output is not XML escaped DESCRIPTION: Submitter invokes following statement: SELECT (XPATH('/*/text()', 'rootlt;/root'))[1]. He expect (escaped) result lt;, but gets AFFECTS: Possible this may affects situations when user wants to insert output from above expression to XML column. PATCH CONTENT: A. 1. Patch fixes above problem (I don't treat this like problem, but like enhancement). A. 2. In addition patch contains test cases for above. A. 3. Patch changes behaviour of method xml_xmlnodetoxmltype invoked by xpath_internal, by adding escape_xml() call. A. 4. I don't see any stability issues with this. A. 5. Performance may be reduced and memory consumption may increase due to internals of method escape_xml === Review === B. 1. Patch applies cleanly. B. 2. Source compiles, and patch works as Submitter wants. B. 3. Personally I missed some notes in documentation that such expression will be escaped (those should be clearly exposed, as the live functionality is changed). B. 4. Submitter, possible, revealed some missed, minor functionality of PostgreSQL XML. As he expects XML escaped output. B. 5. Currently XPATH produces escaped output for Element nodes, and not escaped output for all other types of nodes (including text, comment, etc.) B. 6. Current behaviour _is intended_ (there is if to check node type) and _natural_. In this particular case user ask for text content of some node, and this content is actually . B. 7. Similar behaviour may be observer e. g. in SQL Server(tm) SELECT x.value('(//text())[1]', 'varchar(256)') FROM #xml_t; Produced B. 8. Even if current behaviour may be treated as wrong it was exposed and other may depends on not escaped content. B. 9. I think, correct approach should go to create new function (basing on one existing) that will be able to escape above. In this situation call should look like (for example only!): SELECT xml_escape((XPATH('/*/text()', 'rootlt;/root')))[1] or SELECT xml_escape((XPATH('/*/text()', 'rootlt;/root'))[1]) One method to operate on array one to operate on single XML datum. Or to think about something like xmltext(). (Compare current status of xml.c) B. 10. If such function (B.9.) is needed and if it will be included is out of scope of this review. Basing mainly on A.1, B.6., B.8., bearing in mind B.10., in my opinion this is subject to reject as need more work, or as invalid. The detailed explanation why such behaviour should not be implemented I will send in review of https://commitfest.postgresql.org/action/patch_view?id=565. Regards, Radosław Smogura P. S. I would like to say sorry, for such late answaer, but I sent this from other mail address, which was not attached to mailing list. Blame KDE KMail not me :) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Review of patch Bugfix for XPATH() if expression returns a scalar value
This is review of patch https://commitfest.postgresql.org/action/patch_view?id=565 Bugfix for XPATH() if expression returns a scalar value Patch applies cleanly, and compiles cleanly too, I didn't checked tests. Form discussion about patch, and referenced thread in this patch http://archives.postgresql.org/pgsql-general/2010-07/msg00355.php, if I understand good such functionality is desired. This patch, I think, gives good approach for dealing with scalar values, and, as well converting value to it's string representation is good too (according to current support for xml), with one exception detailed below. In this patch submitter, similarly to https://commitfest.postgresql.org/action/patch_view?id=580, added functionality for XML-escaping of some kind of values (I think only string scalars are escaped), which is not-natural and may lead to double escaping in some situation, example query may be: SELECT XMLELEMENT(name root, XMLATTRIBUTES(foo.namespace AS sth)) FROM (SELECT (XPATH('namespace-uri(/*)', x))[1] AS namespace FROM (VALUES (XMLELEMENT(name root, XMLATTRIBUTES('n' AS xmlns, 'v' AS value),'t'))) v(x)) as foo; xmlelement - root sth=amp;lt;n/ It's clearly visible that value from attribute is lt;n, not . Every parser will read this as lt;n which is not-natural and will require form consumer/developer to de-escape this on his side - roughly speaking this will be reported as serious bug. I didn't found good reasons why XML-escaping should be included, submitter wrote about inserting this to xml column and possibility of data damage, but didn't give an example. Such example is desired. Conclusion I vote +1 for this patch if escaping will be removed. Regards, Radoslaw Smogura -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hugetables question
Martijn van Oosterhout klep...@svana.org Thursday 23 of June 2011 09:10:20 On Wed, Jun 22, 2011 at 02:31:01PM +0200, Rados??aw Smogura wrote: I strictly disagree with opinion if there is 1% it's worthless. 1% here, 1% there, and finally You get 10%, but of course hugepages will work quite well if will be used in code that require many random jumps. I think this can be reproduced and some not-common case may be found to get performance of about 10% (maybe upload whole table in shared buffer and randomly peek records one by one). I think the point is not that 1% is worthless, but that it hasn't been shown that it is a 1% improvement, becuase the noise is too large. For benefits this small, what you need to is run each test 100 times and check the mean and standard deviation and see whether the improvment is real or not. When the benefit is 10% you only need a handful of runs to prove it, which is why they're accepted easier. Have a nice day, Patriotism is when love of your own people comes first; nationalism, when hate for people other than your own comes first. - Charles de Gaulle I think conclusion from this test was Much more important things are to do, then 1% benefit - not 1% is worthless. I will try today hugepages, with random peeks. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hugetables question
On Wed, 22 Jun 2011 14:24:17 +0300, Marti Raudsepp wrote: On Sun, Jun 19, 2011 at 12:56, Radosław Smogura rsmog...@softperience.eu wrote: I want to implement hugepages for shared memory Hi, Have you read this post by Tom Lane about the performance estimation and a proof-of-concept patch with hugepages? http://archives.postgresql.org/pgsql-hackers/2010-11/msg01842.php It's possible that there was a flaw in his analysis, but his conclusion is that it's not worth it: And the bottom line is: if there's any performance benefit at all, it's on the order of 1%. The best result I got was about 3200 TPS with hugepages, and about 3160 without. The noise in these numbers is more than 1% though. Regards, Marti Actually when I tried to implement hugepages for palloc (I ware unable to write fast and effective mallocator), my result was that when I was using normal pages I got small performance degree, but when I was using huge pages this was faster then normal build (even with infective mallocator). I know there are some problems with accessing higher memory (when server is more then 8GB), and hugepages may resolve this. I strictly disagree with opinion if there is 1% it's worthless. 1% here, 1% there, and finally You get 10%, but of course hugepages will work quite well if will be used in code that require many random jumps. I think this can be reproduced and some not-common case may be found to get performance of about 10% (maybe upload whole table in shared buffer and randomly peek records one by one). Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch - Debug builds without optimization
Greg Stark st...@mit.edu Monday 20 of June 2011 03:39:12 On Thu, Jun 16, 2011 at 9:00 PM, Tom Lane t...@sss.pgh.pa.us wrote: BTW, if you're hacking Postgres code and don't already have a reinstall script, you need one. Mine is basically pg_ctl stop cd $PGBLDROOT/src/backend make install-bin pg_ctl start I've always wondered what other people do to iterate quickly. It's a bit of a pain that you can't just run the binary out of the build tree. This looks a lot safer than some of the things I was considering doing with symlinks. I actually go to installation directory and call in one line (simple because up arrow helps). pg_ctl -D db stop; gmake -C ../postgresql -j5 install; pg_ctl -D db start Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] POSIX question
Hello, I had some idea with hugepagse, and I read why PostgreSQL doesn't support POSIX (need of nattach). During read about POSIX/SysV I found this (thread about dynamic chunking shared memory). http://archives.postgresql.org/pgsql-hackers/2010-08/msg00586.php When playing with mmap I done some approach how to deal with growing files, so... Maybe this approach could resolve both of above problems (POSIX and dynamic shared memory). Here is idea: 1. mmap some large amount of anonymous virtual memory (this will be maximum size of shared memory). 2. init small SysV chunk for shmem header (to keep fallout requirements) 3. SysV remap is Linux specific so unmap few 1st vm pages of step 1. and attach there (2.) 3. a. Lock header when adding chunks (1st chunk is header) (we don't want concurrent chunk allocation) 4. allocate some other chunks of shared memory (POSIX is the best way) and put it in shmem header, put there information like chunk id/name, is this POSIX or SysV, some useful flags (hugepage?) needed by reattaching, attach those in 1. 4b. unlock 3a Point 1. will no eat memory, as memory allocation is delayed and in 64bit platforms you may reserve quite huge chunk of this, and in future it may be possible using mmap / munmap to concat chunks / defrag it etc. Mmap guarants that mmaping with mmap_fixed over already mmaped area will unmap old. A working preview changeset applied for sysv_memory.c maybe quite small. If someone will want to extend memory, he may add new chunk (ofcourse to keep header memory continuous number of chunks is limited). What do you think about this? Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] POSIX question
Florian Pflug f...@phlo.org Monday 20 of June 2011 16:16:58 On Jun20, 2011, at 15:27 , Radosław Smogura wrote: 1. mmap some large amount of anonymous virtual memory (this will be maximum size of shared memory). ... Point 1. will no eat memory, as memory allocation is delayed and in 64bit platforms you may reserve quite huge chunk of this, and in future it may be possible using mmap / munmap to concat chunks / defrag it etc. I think this breaks with strict overcommit settings (i.e. vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to tell the kernel (or glibc) to simply reserve a chunk of virtual address space for further user. Not sure if there's a API for that... best regards, Florian Pflug This may be achived by many other things, like mmap /dev/null. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] POSIX question
Florian Pflug f...@phlo.org Monday 20 of June 2011 17:01:40 On Jun20, 2011, at 16:39 , Radosław Smogura wrote: Florian Pflug f...@phlo.org Monday 20 of June 2011 16:16:58 On Jun20, 2011, at 15:27 , Radosław Smogura wrote: 1. mmap some large amount of anonymous virtual memory (this will be maximum size of shared memory). ... Point 1. will no eat memory, as memory allocation is delayed and in 64bit platforms you may reserve quite huge chunk of this, and in future it may be possible using mmap / munmap to concat chunks / defrag it etc. I think this breaks with strict overcommit settings (i.e. vm.overcommit_memory = 2 on linux). To fix that, you'd need a way to tell the kernel (or glibc) to simply reserve a chunk of virtual address space for further user. Not sure if there's a API for that... best regards, Florian Pflug This may be achived by many other things, like mmap /dev/null. Are you sure? Isn't mmap()ing /dev/null a way to *allocate* memory? Or at least this is what I always thought glibc does when you malloc() are large block at once. (This allows it to actually return the memory to the kernel once you free() it, which isn't possible if the memory was allocated simply by extending the heap). You can work around this by mmap()ing an actual file, because then the kernel knows it can use the file as backing store and thus doesn't need to reserve actual physical memory. (In a way, this just adds additional swap space). Doesn't seem very clean though... Even if there's a way to work around a strict overcommit setting, unless the workaround is a syscall *explicitly* designed for that purpose, I'd be very careful with using it. You might just as well be exploiting a bug in the overcommit accounting logic and future kernel versions may simply choose to fix the bug... best regards, Florian Pflug I'm sure at 99%. When I ware playing with mmap I preallocated, probably, about 100GB of memory. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] POSIX question
Florian Pflug f...@phlo.org Monday 20 of June 2011 17:07:55 On Jun20, 2011, at 17:05 , Radosław Smogura wrote: I'm sure at 99%. When I ware playing with mmap I preallocated, probably, about 100GB of memory. You need to set vm.overcommit_memory to 2 to see the difference. Did you do that? You can do that either with echo 2 /proc/sys/vm/overcommit_memory or by editing /etc/sysctl.conf and issuing sysctl -p. best regards, Florian Pflug I've just created 127TB mapping in Linux - maximum allowed by VM. Trying overcommit with 0,1,2. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Hugetables question
I want to implement hugepages for shared memory, to make it transparent I want to do in this fashion: 1. Reserve memory M of size s 2. Try to allocate hugepage memory of as big size as possible (hs), attach at M. 3. Allocate normal shared memory of size hs - s, and attach it at M+hs. This soulution should work for Linux and Windows, and make no difference for usage of such shared memory in application. (...and this actually works) But in sysv_shmem i saw some checking for memory belonging to other (probably failed) processes, because I can't put new header in step 3, i would like to ask if will be suefficient to: 1. Check creator pid by shmctl. 2. Remove checking of shmem magic 3. Or maybe instead of this better will be to split shared memory, header will be stored under one key and it will contain keys to other individually allocated blocks? Ofocourse moving to POSIX may be much more better, but according to commit feast thread it may be impossible. Maybe some other ideas. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hugetables question
Martijn van Oosterhout klep...@svana.org Sunday 19 of June 2011 12:35:18 On Sun, Jun 19, 2011 at 11:56:15AM +0200, Rados??aw Smogura wrote: I want to implement hugepages for shared memory, to make it transparent I want to do in this fashion: 1. Reserve memory M of size s 2. Try to allocate hugepage memory of as big size as possible (hs), attach at M. 3. Allocate normal shared memory of size hs - s, and attach it at M+hs. This soulution should work for Linux and Windows, and make no difference for usage of such shared memory in application. At least in Linux they're trying to make hugepages transparent, so I'm wondering if this is going to make a difference for Linux in the long term. As for your other problem, Perhaps you can put the shmem block first, before the hugemem block? Would require some pointer fiddling, but seems doable. Habe a nice day, Patriotism is when love of your own people comes first; nationalism, when hate for people other than your own comes first. - Charles de Gaulle Yes shmem will be transparent in Linux, but in any case, currently is only for anonymous memory, and has some disadvantages over explicit hugepages. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XPATH evaluation
Nicolas Barbier nicolas.barb...@gmail.com Friday 17 of June 2011 17:29:57 2011/6/17, Andrew Dunstan and...@dunslane.net: On 06/17/2011 10:55 AM, Radosław Smogura wrote: XML canonization preservs whitespaces, if I remember well, I think there is example. In any case if I will store image in XML (I've seen this), preservation of white spaces and new lines is important. If you store images you should encode them anyway, in base64 or hex. Whitespace that is not at certain obviously irrelevant places (such as right after , between attributes, outside of the whole document, etc), and that is not defined to be irrelevant by some schema (if the parser is schema-aware), is relevant. You cannot just muck around with it and consider that correct. More generally, data that needs that sort of preservation should possibly be in CDATA nodes. CDATA sections are just syntactic sugar (a form of escaping): URL:http://www.w3.org/TR/xml-infoset/#omitted Appendix D: What is not in the Information Set [..] 19. The boundaries of CDATA marked sections. Therefore, there is not such thing as a CDATA node that would be different from just text (Infoset-wise). Note that that does not mean that binary data is never supposed to be altered or that all binary data is to be accepted: e.g., whether newlines are represented using \n, \r, or \r\n is irrelevant; also, binary data that is not valid according to the used encoding must of course not be accepted. Nicolas I would like to send patch to remove formatting. How to deal with collapsing blank nodes I don't know. Regards, Radek diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c index 702b9e3..07464fb 100644 --- a/src/backend/utils/adt/xml.c +++ b/src/backend/utils/adt/xml.c @@ -3262,7 +3262,7 @@ xml_xmlnodetoxmltype(xmlNodePtr cur) buf = xmlBufferCreate(); PG_TRY(); { - xmlNodeDump(buf, NULL, cur, 0, 1); + xmlNodeDump(buf, NULL, cur, 0, 0); result = xmlBuffer_to_xmltype(buf); } PG_CATCH(); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] XPATH evaluation
Hello, During review of https://commitfest.postgresql.org/action/patch_view?id=580 I found following problems with XPath. 1. SELECT (XPATH('/root/*', 'root xmlns:o=http://olacle.com/db; xmlns:p=http://postgresql.org/db;o:dbab/b/a/o:dbp:db/p:db/root')); Produces: {o:db a b/ /a /o:db,p:db/} In above b/b was reduced to b/ this is different infoset then input, and those notations are differently interpreted e.g. by XML Binding WebServices. The 1st one will may be mapped to empty string, and 2nd one to to null. As well result was formatted which produces again different infoset. Both of above may cause problems with XML digesting. 2. SELECT (XPATH('/root/*', 'root xmlns:o=http://olacle.com/db; xmlns:p=http://postgresql.org/db;o:db/o:dbp:db/p:db/root')); {o:db/,p:db/} In above I missing namespaces. I may take on assignment 1st (fix is simple), but for 2nd I have mixed fillings. I think 2nd should transfer namespaces in some way to client. What do You think? Regards, Radosław Smogura -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XPATH evaluation
Andrew Dunstan and...@dunslane.net Friday 17 of June 2011 15:47:04 On 06/17/2011 05:41 AM, Florian Pflug wrote: On Jun17, 2011, at 11:09 , Radosław Smogura wrote: 1. SELECT (XPATH('/root/*', 'root xmlns:o=http://olacle.com/db; xmlns:p=http://postgresql.org/db;o:dbab/b/a/o:dbp:db/p :db/root')); Produces: {o:db a b/ /a /o:db,p:db/} In aboveb/b was reduced tob/ this is different infoset then input, and those notations are differently interpreted e.g. by XML Binding WebServices. The 1st one will may be mapped to empty string, and 2nd one to to null. Oh, joy :-( I thought these were basically supposed to be the same. The XML Information Set for example specifically excludes: The difference between the two forms of an empty element: |foo/ | and |foo/foo|. See http://www.w3.org/TR/2004/REC-xml-infoset-20040204/ Appendix D. Note that this implies that foo/foo does not have content of an empty string, but that it has no content. cheers andrew Indeed, Infoset Spec, and XML Canonization Spec treats foo/foo same, as foo/ - my wrong, but XML canonization preservs whitespaces, if I remember well, I think there is example. In any case if I will store image in XML (I've seen this), preservation of white spaces and new lines is important. Regards, Radek. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XPATH evaluation
Florian Pflug f...@phlo.org Friday 17 of June 2011 11:41:08 On Jun17, 2011, at 11:09 , Radosław Smogura wrote: 1. SELECT (XPATH('/root/*', 'root xmlns:o=http://olacle.com/db; xmlns:p=http://postgresql.org/db;o:dbab/b/a/o:dbp:db/p: db/root')); Produces: {o:db a b/ /a /o:db,p:db/} In above b/b was reduced to b/ this is different infoset then input, and those notations are differently interpreted e.g. by XML Binding WebServices. The 1st one will may be mapped to empty string, and 2nd one to to null. Oh, joy :-( Does this happen only with my patch applied or also with unpatched HEAD? 2. SELECT (XPATH('/root/*', 'root xmlns:o=http://olacle.com/db; xmlns:p=http://postgresql.org/db;o:db/o:dbp:db/p:db/root')); {o:db/,p:db/} In above I missing namespaces. Hm, that's a hard problem a think. Your problem (1) basically tells us that ideally we'd return the matching parts of an XML document unmodified. Now, (2) tells us that isn't to most sensible thing to do either. I may take on assignment 1st (fix is simple) Whats your proposed fix for (1)? , but for 2nd I have mixed fillings. I think 2nd should transfer namespaces in some way to client. I don't see how XPATH() can do that without breaking it's API. The only thing we could do AFAICS is the define a second XPATH evaluation function which returns a list of namespace declarations (prefix and uri) for every node. best regards, Florian Pflug No this is not about Your patch, but was inspired by it. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] XPATH evaluation
Andrew Dunstan and...@dunslane.net Friday 17 of June 2011 17:09:25 On 06/17/2011 10:55 AM, Radosław Smogura wrote: Andrew Dunstanand...@dunslane.net Friday 17 of June 2011 15:47:04 On 06/17/2011 05:41 AM, Florian Pflug wrote: On Jun17, 2011, at 11:09 , Radosław Smogura wrote: 1. SELECT (XPATH('/root/*', 'root xmlns:o=http://olacle.com/db; xmlns:p=http://postgresql.org/db;o:dbab/b/a/o:dbp:db/ p :db/root')); Produces: {o:db a b/ /a /o:db,p:db/} In aboveb/b was reduced tob/ this is different infoset then input, and those notations are differently interpreted e.g. by XML Binding WebServices. The 1st one will may be mapped to empty string, and 2nd one to to null. Oh, joy :-( I thought these were basically supposed to be the same. The XML Information Set for example specifically excludes: The difference between the two forms of an empty element: |foo/ | and |foo/foo|. Seehttp://www.w3.org/TR/2004/REC-xml-infoset-20040204/ Appendix D. Note that this implies thatfoo/foo does not have content of an empty string, but that it has no content. cheers andrew Indeed, Infoset Spec, and XML Canonization Spec treatsfoo/foo same, as foo/ - my wrong, but XML canonization preservs whitespaces, if I remember well, I think there is example. In any case if I will store image in XML (I've seen this), preservation of white spaces and new lines is important. If you store images you should encode them anyway, in base64 or hex. More generally, data that needs that sort of preservation should possibly be in CDATA nodes. cheers andrew I know this answer, because this solution is better. But, during one work I created XSL-FO with whitespace preserve attribute, if I would like to get part of such XSL-FO I could destroy output document. But those use-cases doesn't change fact that XPATH output doesn't preserves whitepsaces, newlines, and produces different node, then was in original. It same as regexp form varchar will trim result without control. I emphasize this because it may cause problems with XML Digest algorithms which are quite popular and may cause some legal! problems when you try to use Advance Signature in Europe Union, as well with other application. With XML Binding it's quite popular to interpret foo/ as null, foo/foo as empty string. In particulary mantoined Infoset Spec doesn't matters here. I think no-formatting is reasonable requirement for XPATH function. Regards, Radek. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [Streaming] Datum compression
Hello, I think about following functionality to add to streaming. In some way earlier or later soemone will add possibility for gzip communication. As I try to make streaming, I wonder if it is possible to change TOAST compression to GZIP/zLib, currently is LZMA. I want to make this: if user want to use compression during communication, server will be able to send compressed datum, instead of doing recompression. GZIP is actually quite popular, and skipping plain Windows C, it's supported on all driver platforms, including Java and .NET. What do You think about this approach. Regards, Radek. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Patch - Debug builds without optimization
Hello, I'm sending following patch which disables optimization when --enable-debug is passed. It was nasty (for me, at least) that debug build required passing of CFLAGS with -O0 to get nice traceable code. Regards, Radekdiff --git a/configure.in b/configure.in old mode 100644 new mode 100755 index e6232af..3759645 --- a/configure.in +++ b/configure.in @@ -194,10 +194,10 @@ PGAC_ARG_BOOL(enable, spinlocks, yes, [do not use spinlocks]) # -# --enable-debug adds -g to compiler flags +# --enable-debug adds -g and (-O0) to compiler flags # PGAC_ARG_BOOL(enable, debug, no, - [build with debugging symbols (-g)]) + [build with debugging symbols (-g) and no optimization (-O0)]) AC_SUBST(enable_debug) # @@ -402,7 +402,7 @@ unset CFLAGS # If the user specifies something in the environment, that is used. # else: If the template file set something, that is used. # else: If coverage was enabled, don't set anything. -# else: If the compiler is GCC, then we use -O2. +# else: If the compiler is GCC, then we use -O2, unless debug is passed. # else: If the compiler is something else, then we use -O, unless debugging. if test $ac_env_CFLAGS_set = set; then @@ -412,11 +412,18 @@ elif test ${CFLAGS+set} = set; then elif test $enable_coverage = yes; then : # no optimization by default elif test $GCC = yes; then - CFLAGS=-O2 + # if the user selected debug mode, use -O0, instead of nothing equivalent to + # (-O1) + if test $enable_debug != yes; then +CFLAGS=-O0 + else +CFLAGS=-O2 + fi else - # if the user selected debug mode, don't use -O + # if the user selected debug mode, use -O0, instead of nothing equivalent to + # (-O1) if test $enable_debug != yes; then -CFLAGS=-O +CFLAGS=-O0 fi fi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch - Debug builds without optimization
On Thu, 16 Jun 2011 14:30:27 +0200, Radosław Smogura wrote: Hello, I'm sending following patch which disables optimization when --enable-debug is passed. It was nasty (for me, at least) that debug build required passing of CFLAGS with -O0 to get nice traceable code. Regards, Radek Sorry for mess, this should be submited. diff --git a/configure.in b/configure.in old mode 100644 new mode 100755 index e6232af..47edf82 --- a/configure.in +++ b/configure.in @@ -194,10 +194,10 @@ PGAC_ARG_BOOL(enable, spinlocks, yes, [do not use spinlocks]) # -# --enable-debug adds -g to compiler flags +# --enable-debug adds -g and (-O0) to compiler flags # PGAC_ARG_BOOL(enable, debug, no, - [build with debugging symbols (-g)]) + [build with debugging symbols (-g) and no optimization (-O0)]) AC_SUBST(enable_debug) # @@ -402,7 +402,7 @@ unset CFLAGS # If the user specifies something in the environment, that is used. # else: If the template file set something, that is used. # else: If coverage was enabled, don't set anything. -# else: If the compiler is GCC, then we use -O2. +# else: If the compiler is GCC, then we use -O2, unless debug is passed. # else: If the compiler is something else, then we use -O, unless debugging. if test $ac_env_CFLAGS_set = set; then @@ -412,7 +412,13 @@ elif test ${CFLAGS+set} = set; then elif test $enable_coverage = yes; then : # no optimization by default elif test $GCC = yes; then - CFLAGS=-O2 + # if the user selected debug mode, use -O0, instead of nothing equivalent to + # (-O1) + if test $enable_debug == yes; then +CFLAGS=-O0 + else +CFLAGS=-O2 + fi else # if the user selected debug mode, don't use -O if test $enable_debug != yes; then -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [WIP] [Stream] Preview of pg_type changes
Hello, Here I would like to expose changes to pg_type and type infrastructure about streaming. Changes are as follows: - added new column typstreamin typestremout - general contract for those is for streamin same as receive (receive use internal), for streamout it is (internal, type) - changes to psql help - and all functionality for type manipulation. There is no streamin/streamout methods implemented. If someone wants and have time, as this is WIP patch, then suggestions are welcome. Regards, Radek stream_pg_type.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch - Debug builds without optimization
On Thu, 16 Jun 2011 16:00:21 -0400, Tom Lane wrote: Alvaro Herrera alvhe...@commandprompt.com writes: I disagree with this change. Debug builds are very useful to have in production, and you don't want to be running -O0 there. I have found that you can use a src/Makefile.custom like this for those times when you want to debug stuff in a particular set of files: CFLAGS := $(patsubst -O2,-O0,$(CFLAGS)) Then you remove the .o files that you want to debug, and rerun make. FWIW, I only use Makefile.custom for more-or-less-permanent changes to the build behavior of a particular machine. For one-shot things like recompiling some particular file(s) at -O0, it's easier to do this: rm foo.o make PROFILE=-O0 reinstall postgres executable The makefiles automatically add PROFILE at the end of CFLAGS, so you can inject any compile flag this way --- I think the original intent was to use it to add -pg for gprof-enabled builds. But it's handy for this. BTW, if you're hacking Postgres code and don't already have a reinstall script, you need one. Mine is basically pg_ctl stop cd $PGBLDROOT/src/backend make install-bin pg_ctl start regards, tom lane Thanks, Actually I do something like above, but good to know install-bin target, I fired before gmake -j5 install. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Crash dumps
Hello, Because, I work a little bit on streaming protocol and from time to time I have crashes. I want ask if you wont crash reporting (this is one of minors products from mmap playing) those what I have there is mmaped areas, and call stacks, and some other stuff. This based reports works for Linux with gdb, but there is some pluggable architecture, which connects with segfault - one thing that should be considered is to kill other processes immediately when reporting started (as taking report takes some time) so some IPC will be required. I may polish this a little bit, and send patch for this (currently without IPC killing of others). Regards, Radosław Smogura -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] .gitignore for some of cygwin files
On Thu, 9 Jun 2011 14:12:59 +0200, Magnus Hagander wrote: On Wed, Jun 8, 2011 at 12:34, Radosław Smogura rsmog...@softperience.eu wrote: After compilation I got big list of differences about binaries and other files. You may see it in patch. Hmm. Does this really affect just cygwin, and not also mingw? I know msvc builds out-of-tree, but mingw doesn't, iirc? WHere do the EXEs go there? Actually I called compilation from make in cygwin env, as I can't run build on Windows7/VisualStudio 2010. I don't know if this is preferred method. From configure script checking for ld used by GCC... /usr/i686-pc-cygwin/bin/ld.exe checking if the linker (/usr/i686-pc-cygwin/bin/ld.exe) is GNU ld... yes CC runs as GNU C (GCC) version 4.3.4 20090804 (release) 1 (i686-pc-cygwin) compiled by GNU C version 4.3.4 20090804 (release) 1, GMP version 4.3.1, MPFR version 2.4.1-p5. GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072 After compilation I got many of *dll.def and others. What's nbproject? Just configuration from some editor. It looks like any move in project creates this entry in .gitignore As for the patch itself, we don't normally keep comments in the .gitignore files, I'm not sure if we want to start doing that... This is not my part - just wanted to comment what and why. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] .gitignore for some of cygwin files
On Thu, 9 Jun 2011 14:12:59 +0200, Magnus Hagander wrote: On Wed, Jun 8, 2011 at 12:34, Radosław Smogura rsmog...@softperience.eu wrote: After compilation I got big list of differences about binaries and other files. You may see it in patch. Hmm. Does this really affect just cygwin, and not also mingw? I know msvc builds out-of-tree, but mingw doesn't, iirc? WHere do the EXEs go there? What's nbproject? As for the patch itself, we don't normally keep comments in the .gitignore files, I'm not sure if we want to start doing that... Sorry for trash this is `git status` after compilation on branch # .gitignore-for-cygwin.diff # blob_20110601.patch # nbproject/ # src/backend/postgres.def # src/backend/postgres.exe # src/backend/replication/libpqwalreceiver/liblibpqwalreceiverdll.def # src/backend/snowball/libdict_snowballdll.def # src/backend/utils/adt/blob.c # src/backend/utils/mb/conversion_procs/ascii_and_mic/libascii_and_micdll.def # src/backend/utils/mb/conversion_procs/cyrillic_and_mic/libcyrillic_and_micdll.def # src/backend/utils/mb/conversion_procs/euc2004_sjis2004/libeuc2004_sjis2004dll.def # src/backend/utils/mb/conversion_procs/euc_cn_and_mic/libeuc_cn_and_micdll.def # src/backend/utils/mb/conversion_procs/euc_jp_and_sjis/libeuc_jp_and_sjisdll.def # src/backend/utils/mb/conversion_procs/euc_kr_and_mic/libeuc_kr_and_micdll.def # src/backend/utils/mb/conversion_procs/euc_tw_and_big5/libeuc_tw_and_big5dll.def # src/backend/utils/mb/conversion_procs/latin2_and_win1250/liblatin2_and_win1250dll.def # src/backend/utils/mb/conversion_procs/latin_and_mic/liblatin_and_micdll.def # src/backend/utils/mb/conversion_procs/utf8_and_ascii/libutf8_and_asciidll.def # src/backend/utils/mb/conversion_procs/utf8_and_big5/libutf8_and_big5dll.def # src/backend/utils/mb/conversion_procs/utf8_and_cyrillic/libutf8_and_cyrillicdll.def # src/backend/utils/mb/conversion_procs/utf8_and_euc2004/libutf8_and_euc2004dll.def # src/backend/utils/mb/conversion_procs/utf8_and_euc_cn/libutf8_and_euc_cndll.def # src/backend/utils/mb/conversion_procs/utf8_and_euc_jp/libutf8_and_euc_jpdll.def # src/backend/utils/mb/conversion_procs/utf8_and_euc_kr/libutf8_and_euc_krdll.def # src/backend/utils/mb/conversion_procs/utf8_and_euc_tw/libutf8_and_euc_twdll.def # src/backend/utils/mb/conversion_procs/utf8_and_gb18030/libutf8_and_gb18030dll.def # src/backend/utils/mb/conversion_procs/utf8_and_gbk/libutf8_and_gbkdll.def # src/backend/utils/mb/conversion_procs/utf8_and_iso8859/libutf8_and_iso8859dll.def # src/backend/utils/mb/conversion_procs/utf8_and_iso8859_1/libutf8_and_iso8859_1dll.def # src/backend/utils/mb/conversion_procs/utf8_and_johab/libutf8_and_johabdll.def # src/backend/utils/mb/conversion_procs/utf8_and_sjis/libutf8_and_sjisdll.def # src/backend/utils/mb/conversion_procs/utf8_and_sjis2004/libutf8_and_sjis2004dll.def # src/backend/utils/mb/conversion_procs/utf8_and_uhc/libutf8_and_uhcdll.def # src/backend/utils/mb/conversion_procs/utf8_and_win/libutf8_and_windll.def # src/bin/initdb/initdb.exe # src/bin/pg_basebackup/pg_basebackup.exe # src/bin/pg_config/pg_config.exe # src/bin/pg_controldata/pg_controldata.exe # src/bin/pg_ctl/pg_ctl.exe # src/bin/pg_dump/pg_dump.exe # src/bin/pg_dump/pg_dumpall.exe # src/bin/pg_dump/pg_restore.exe # src/bin/pg_resetxlog/pg_resetxlog.exe # src/bin/psql/psql.exe # src/bin/scripts/clusterdb.exe # src/bin/scripts/createdb.exe # src/bin/scripts/createlang.exe # src/bin/scripts/createuser.exe # src/bin/scripts/dropdb.exe # src/bin/scripts/droplang.exe # src/bin/scripts/dropuser.exe # src/bin/scripts/reindexdb.exe # src/bin/scripts/vacuumdb.exe # src/include/utils/blob.h # src/interfaces/ecpg/preproc/ecpg.exe # src/pl/plpgsql/src/libplpgsqldll.def # src/test/regress/libregressdll.def # src/test/regress/pg_regress.exe # src/timezone/zic.exe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] .gitignore for some of cygwin files
Magnus Hagander mag...@hagander.net Thursday 09 of June 2011 16:25:07 On Thu, Jun 9, 2011 at 15:40, Radosław Smogura rsmog...@softperience.eu wrote: On Thu, 9 Jun 2011 14:12:59 +0200, Magnus Hagander wrote: On Wed, Jun 8, 2011 at 12:34, Radosław Smogura rsmog...@softperience.eu wrote: After compilation I got big list of differences about binaries and other files. You may see it in patch. Hmm. Does this really affect just cygwin, and not also mingw? I know msvc builds out-of-tree, but mingw doesn't, iirc? WHere do the EXEs go there? What's nbproject? As for the patch itself, we don't normally keep comments in the .gitignore files, I'm not sure if we want to start doing that... Sorry for trash this is `git status` after compilation on branch # .gitignore-for-cygwin.diff # blob_20110601.patch # nbproject/ # src/backend/postgres.def # src/backend/postgres.exe # src/backend/replication/libpqwalreceiver/liblibpqwalreceiverdll.def # src/backend/snowball/libdict_snowballdll.def # src/backend/utils/adt/blob.c # src/backend/utils/mb/conversion_procs/ascii_and_mic/libascii_and_micdll.d ef # src/backend/utils/mb/conversion_procs/cyrillic_and_mic/libcyrillic_and_mi cdll.def # src/backend/utils/mb/conversion_procs/euc2004_sjis2004/libeuc2004_sjis200 4dll.def # src/backend/utils/mb/conversion_procs/euc_cn_and_mic/libeuc_cn_and_micdll .def # src/backend/utils/mb/conversion_procs/euc_jp_and_sjis/libeuc_jp_and_sjisd ll.def # src/backend/utils/mb/conversion_procs/euc_kr_and_mic/libeuc_kr_and_micdll .def # src/backend/utils/mb/conversion_procs/euc_tw_and_big5/libeuc_tw_and_big5d ll.def # src/backend/utils/mb/conversion_procs/latin2_and_win1250/liblatin2_and_wi n1250dll.def # src/backend/utils/mb/conversion_procs/latin_and_mic/liblatin_and_micdll.d ef # src/backend/utils/mb/conversion_procs/utf8_and_ascii/libutf8_and_asciidll .def # src/backend/utils/mb/conversion_procs/utf8_and_big5/libutf8_and_big5dll.d ef # src/backend/utils/mb/conversion_procs/utf8_and_cyrillic/libutf8_and_cyril licdll.def # src/backend/utils/mb/conversion_procs/utf8_and_euc2004/libutf8_and_euc200 4dll.def # src/backend/utils/mb/conversion_procs/utf8_and_euc_cn/libutf8_and_euc_cnd ll.def # src/backend/utils/mb/conversion_procs/utf8_and_euc_jp/libutf8_and_euc_jpd ll.def # src/backend/utils/mb/conversion_procs/utf8_and_euc_kr/libutf8_and_euc_krd ll.def # src/backend/utils/mb/conversion_procs/utf8_and_euc_tw/libutf8_and_euc_twd ll.def # src/backend/utils/mb/conversion_procs/utf8_and_gb18030/libutf8_and_gb1803 0dll.def # src/backend/utils/mb/conversion_procs/utf8_and_gbk/libutf8_and_gbkdll.def # src/backend/utils/mb/conversion_procs/utf8_and_iso8859/libutf8_and_iso885 9dll.def # src/backend/utils/mb/conversion_procs/utf8_and_iso8859_1/libutf8_and_iso8 859_1dll.def # src/backend/utils/mb/conversion_procs/utf8_and_johab/libutf8_and_johabdll .def # src/backend/utils/mb/conversion_procs/utf8_and_sjis/libutf8_and_sjisdll.d ef # src/backend/utils/mb/conversion_procs/utf8_and_sjis2004/libutf8_and_sjis2 004dll.def # src/backend/utils/mb/conversion_procs/utf8_and_uhc/libutf8_and_uhcdll.def # src/backend/utils/mb/conversion_procs/utf8_and_win/libutf8_and_windll.def # src/bin/initdb/initdb.exe # src/bin/pg_basebackup/pg_basebackup.exe # src/bin/pg_config/pg_config.exe # src/bin/pg_controldata/pg_controldata.exe # src/bin/pg_ctl/pg_ctl.exe # src/bin/pg_dump/pg_dump.exe # src/bin/pg_dump/pg_dumpall.exe # src/bin/pg_dump/pg_restore.exe # src/bin/pg_resetxlog/pg_resetxlog.exe # src/bin/psql/psql.exe # src/bin/scripts/clusterdb.exe # src/bin/scripts/createdb.exe # src/bin/scripts/createlang.exe # src/bin/scripts/createuser.exe # src/bin/scripts/dropdb.exe # src/bin/scripts/droplang.exe # src/bin/scripts/dropuser.exe # src/bin/scripts/reindexdb.exe # src/bin/scripts/vacuumdb.exe # src/include/utils/blob.h # src/interfaces/ecpg/preproc/ecpg.exe # src/pl/plpgsql/src/libplpgsqldll.def # src/test/regress/libregressdll.def # src/test/regress/pg_regress.exe # src/timezone/zic.exe Based on this list, a global exclude for *.exe and lib*dll.def seems reasonable. Andrew, does that seem right to you as well? I definitely want to keep the nbproject thing out of there, that belongs in a personal exclude. utils/adt/blob.c and include/utils/blob.h - what is that? That's something that's actually from your development work and should not be ignored, right? Thats mine, sorry, looks I didn't get diff directly from master. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org
[HACKERS] .gitignore for some of cygwin files
After compilation I got big list of differences about binaries and other files. You may see it in patch. Regards, Radekdiff --git a/.gitignore b/.gitignore old mode 100644 new mode 100755 index 81c4d5e..75e1d9a --- a/.gitignore +++ b/.gitignore @@ -20,6 +20,14 @@ lcov.info *.vcproj win32ver.rc +# Defs from gcc in Windows +*dll.def +/src/backend/postgres.def + +# Windows executables, unless you don't keep file with this extension in +# repository global ignore is quite enaugh +*.exe + # Local excludes in root directory /GNUmakefile /config.log @@ -28,3 +36,4 @@ win32ver.rc /pgsql.sln.cache /Debug/ /Release/ +/nbproject/private/ \ No newline at end of file -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BLOB support
On Sun, 05 Jun 2011 22:16:41 +0200, Dimitri Fontaine wrote: Tom Lane t...@sss.pgh.pa.us writes: Yes. I think the appropriate problem statement is provide streaming access to large field values, as an alternative to just fetching/storing the entire value at once. I see no good reason to import the entire messy notion of LOBS/CLOBS. (The fact that other databases have done it is not a good reason.) Spent some time in the archive to confirm a certain “déjà vu” impression. Couldn't find it. Had to manually search in closed commit fests… but here we are, I think: https://commitfest.postgresql.org/action/patch_view?id=70 http://archives.postgresql.org/message-id/17891.1246301...@sss.pgh.pa.us http://archives.postgresql.org/message-id/4a4bf87e.7010...@ak.jp.nec.com Regards, I think more about this with contrast to sent references, but I still have in my mind construct Blob myWeddingDvd = conn.createBlob(myWeddingStream, size); //A bit outdated we have BlueRay conn.prepareStatemnt(INSERT INTO someonetubevideos values (?)) where 1st parameter is myWeddingDvd, or if someone doesn't like Java he/she/it may wish to put C++ istream, or C FILE. I think (with respect to below consideration), this implicite requires that LOBs should be stored in one, centralized place doesn't matter if this will be file system or special table, or something else, but when statement is processed there is no idea with which table LOB will be associated, if we want to TOAST, where TOAST it, what will be if insertion will by SQL function, which choose table depending on BLOB content? Quite interesting idea from cited patch was about string identifying LOB, but with above it close road to for JDBC create LOB. I think, as well constructs that insert 1st, small LOB into table to get some driver depending API are little bit old fashioned. Possible solutions, if we don't want centralized storage, may be: 1. Keep BLOB in memory, but this may, depending on implementation, reduce size of initial BLOB. 2. Temporally backup blob in file, then when values are stored copy file to TOAST table, but still some changes are required to support LOBs for complex types and arrays. So please give some ideas how to resolve this, or may be it has low priority? Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BLOB support
Tom Lane t...@sss.pgh.pa.us Monday 06 of June 2011 16:13:26 =?UTF-8?Q?Rados=C5=82aw_Smogura?= rsmog...@softperience.eu writes: I think more about this with contrast to sent references, but I still have in my mind construct Blob myWeddingDvd = conn.createBlob(myWeddingStream, size); //A bit outdated we have BlueRay conn.prepareStatemnt(INSERT INTO someonetubevideos values (?)) where 1st parameter is myWeddingDvd, Yes, if you insist upon designing the API like that, then you come to the conclusion that you need global LOB identifiers. However, there are many ways to design this that don't work that way. One idea to think about is insert into someonetubevideos values('') returning open_for_write(videocolumn) which gives you back some kind of writable stream ID (this is a transient, within-session ID, not global) for the target field in the row you just inserted. I know, but this is a little bit old-fashioned bahaviour. BTW, as was noted upthread by Dimitri, this whole subject has been discussed before on pgsql-hackers. You really ought to go re-read the previous threads. regards, tom lane I read this, but it may be rethinked again. Actaully changes to TOAST (I mean streaming will be just for LOBs, I had written all found disadvantages for TOAST for LOB and it's looks like only performance of above is some kind of disadvantage, as well this prevent some less usefull concepts of Copy on Write for LOBs. Introducing streaming for TOAST is little useless, sorry just for cite from my, mentoined document: (This is generally about on demand stream of TOASTed value, in context of LOBs is acceptable, as long not transactional aware LOBs are acceptable). If we will add streaming of TOASTed values, so caller will get some reference to this value, we need to ensure that pointed data will not be changed, nor deleted - I think this will require caller to add FOR UPDATE (or silently to add this by server) for each statement returning pointers to TOASTed, as client may do transactional query, and other client just after (1st) may remove record, commit, and call VACUUM. In this situation when 1st will try to read data form given row, it will get error. This may be accpetable for LOBs (commonly LOBs may be not transaction aware, but I will be angry if this will happen with VARCHAR) If this is acceptable I will do following changes. Add - server_max_in_memory_lob_size - GUC server start-only config to describe maximum value of client session parameter max_in_memory_lob. - max_in_memory_lob - session GUC describing how huge LOBs may be keept in memory before backing up to file - rescursivly toasting, detoasting during insert/update/remove for searching for LOBs (we need this for arrays and complex types) - this is for last stage (error disallowing LOBs in composites/arrays may be quite enaugh, for begining) - I want LOBs to be starting point for LOBing other types (e.g. some big arrays may be LOBbed). - during toasting, lob will be toasted and in place of LOB, the reference to it will be putted, and encoded in LOB datum. - api for LOB manipulation (few changes to current implementation) in way that BLOB is field type and uninon of in_memory, in_file, in_toast. This few bytes for LOB will not affect size of datum looking at size of LOB. - api for maintaing temoraly lob files, we need this as per session list of id - file desc, to prevent prevent stealing of lobs by different connections (security) - streaming api for TOASTED values (based on COPY protocol, or changed COPY protocol) or at least function calls - I havent looked at this in context of TOASTed LOBs. Is it good? Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Auto adjust send buffer size to congention window
Hi, I've got idea to auto adjust send buffer size to size of TCP congention window. Will this increase performance and in which way. I suppose throughput may be increased, but latency decreased. What do You think? Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Auto adjust send buffer size to congention window
Tom Lane t...@sss.pgh.pa.us Monday 06 of June 2011 04:07:41 =?utf-8?q?Rados=C5=82aw_Smogura?= rsmog...@softperience.eu writes: I've got idea to auto adjust send buffer size to size of TCP congention window. Will this increase performance and in which way. I suppose throughput may be increased, but latency decreased. What do You think? I think if that makes any difference at all, it would be proof that the TCP stack on your machine was implemented by incompetents. The kernel should already be handling such considerations. regards, tom lane I was thought about send buffer size in libpq. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BLOB support
Tom Lane t...@sss.pgh.pa.us Friday 03 of June 2011 16:44:13 Alvaro Herrera alvhe...@commandprompt.com writes: Excerpts from RadosÅaw Smogura's message of jue jun 02 15:26:29 -0400 2011: So do I understand good should We think about create bettered TOAST to support larger values then 30-bit length? I like this much more, Good :-) (BTW while it'd be good to have longer-than-30 bit length words for varlena, I'm not sure we have room for that.) You wouldn't want to push such values around as whole values anyway. Possibly what would work here is a variant form of TOAST pointer for which we'd simply throw error if you tried to fetch the entire value at once. regards, tom lane I mentoined about JDBC call. 1. b = conn.createBlob(); ps = conn.preparesStatement(INSRT INTO t blob = ?); ps.setBlob(1, b); ps.executeQuery(); 2. Statements could be as well too SELECT blob_index_of(?, ?); where 1st ? is blob, 2nd one is some text/sequence This call must be supported to make BLOB as far as possible simillar to other types, this actually disallow to put LOB in TOAST as there is no relation (2.) or relation is not known (1.) during reading LOB - in any case you can't skip bytes from protocol stream, so possible solutions are: 1. Create temporaly LOB in file or memory depending of it size. 2. Use existing Large Object interface. 3. Client will not send LOB, just it's faked ID and during call Server will ask client to serialize this LOB, by faked id. 4. Any other propositions? I vote for 2. For pg_type/class changes I think about removing haslobs, and put this as attstorage or somewhere else for example ('l' may stand for lobs or -3 length), but currently TOAST composites doesn't support recursion, and those types are toasted as whole. I may add recursion for those types, and support special maintain for LOBs. In any case handling this will be better in toasting code, not in nodeModifyTable. Any ideas about this? Reagrds, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BLOB support
Tom Lane t...@sss.pgh.pa.us Friday 03 of June 2011 16:44:13 Alvaro Herrera alvhe...@commandprompt.com writes: Excerpts from RadosÅaw Smogura's message of jue jun 02 15:26:29 -0400 2011: So do I understand good should We think about create bettered TOAST to support larger values then 30-bit length? I like this much more, Good :-) (BTW while it'd be good to have longer-than-30 bit length words for varlena, I'm not sure we have room for that.) You wouldn't want to push such values around as whole values anyway. Possibly what would work here is a variant form of TOAST pointer for which we'd simply throw error if you tried to fetch the entire value at once. regards, tom lane Ok, now it's more clear about this, what You have talked about, but I still need to pass constant ID to client. Actually, this variant must be passed to client. Form other side, as BLOB may be created before statement invoke or if it's called. This will require to create tempolary BLOBs, and introducing v3.1 protocol, which will allow to stream values greater then 4GB, by passing -2 size in length fields, and introducing stream_in/out in pg_type (this is from my concept of streaming protocol). So I think better will be to introduce 1st streaming protocol, as it is on top LOBs. I will send thread for this in a moment. Why? The tuples are not going away due to MVCC anyway. Vaccum / autovacumm + no lock may be enaugh, I think. Constant ID is required. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Streaming solution and v3.1 protocol
Hello, Sorry for short introduction about this, and plese as far as possible, disconnet it from LOBs, as it on top of LOB. Idea of streaming is to reduce memory copy mainly during receiving and sending tuples. Currently receive works as follows 1. Read bytes of tuple (allocate x memory). 2. Eventually convert it to database encoding. 3. Use this data and create datum (which is little changed copy of 1 or 2) 4. Streaming will be allowed only in binary mode, and actually stream in/out will return binary data. Look for example at execution chain from textrecv. Idea is to add stream_in, stream_out columns pg_type. When value is to be serialized the sin/sout is called. Struct must pass len of data, and struct of stream (simillar to C FILE*). Caller should validate if all bytes has been consumed (expose simple methods for this) To implement(code API requirements): First stream is buffered socekt reader. Fast substreams - create fast stream limited to x bytes basing on other stream Skipping bytes + skipAll() Stream filtering - do fast (faster will be if conversion will occure in buffered chunks) encoding conversion. Support for direct PG printf() version. Linux has ability to create cookie streams and use it with fprintf(), so its greate advantage to format huge strings. Other system should buffer output. Problem is if Linux cookie will fail will it write something to output? Windows proxy will push value to temp buffer. Good idea may be to introduce new version of protocol reserving len field value (-2) for fixed size streams above 4GB (-3) for chunked streaming - actualy is innovative functionality and it's not required by any driver. In streaming it may be imagined that socket's fd will be passed to sin functions. Problems: during output - something failed while writing. Resolution, add some control flags for each n-bytes send to client. This will prevent sending of e.g. 4GB of data if first byte filed, You send only n-bytes and then abort is received - or send data in frames. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BLOB support
Tom Lane t...@sss.pgh.pa.us Friday 03 of June 2011 18:08:56 Robert Haas robertmh...@gmail.com writes: On Thu, Jun 2, 2011 at 12:53 PM, RadosÅaw Smogura rsmog...@softperience.eu wrote: 1. No tracking of unused LO (you store just id of such object). You may leak LO after row remove/update. User may write triggers for this, but it is not argument - BLOB type is popular, and it's simplicity of use is quite important. When I create app this is worst thing. 2. No support for casting in UPDATE/INSERT. So there is no way to simple migrate data (e.g. from too long varchars). Or to copy BLOBs. 3. Limitation of field size to 1GB. As a general point, it would probably be a good idea to address each of these issues separately, and to have a separate discussion about each one. As to #1 specifically, if you use a text or bytea field rather than a large object per se, then this issue goes away. But then you lose the streaming functionality. So at least some people here are saying that we should try to fix that by adding the streaming functionality to text/bytea rather than by doing anything to the large object facility. #2 is also a problem that only becomes a problem if you insist that LOBs have to be a distinct kind of value. regards, tom lane And one more topic to discuss. Should blob be referencable, e.g. I create in JDBC new Blob, I set stream for it what should happen if I will call UPDATE t set b = ? where 1=1 ? This is not about copy on write. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BLOB support
On Thu, 02 Jun 2011 15:39:39 +0300, Peter Eisentraut wrote: On ons, 2011-06-01 at 22:00 +0200, Radosław Smogura wrote: I partialy implemented following missing LOBs types. Requirement for this was to give ability to create (B/C)LOB columns and add casting functionality e.g. SET my_clob = 'My long text'. Idea is as follow: 0. Blob is two state object: 1st in memory contains just bytea, serialized contains Oid of large object. 1. Each type has additional boolean haslobs, which is set recursivly. 2. Relation has same bool haslobs (used to speed up tables without LOBs) 3. When data are inserted/updated then special function is called and tuple is modified in this way all LOBs are serialized to (old) LOB table and just Oid is stored. 4. When removed LOB is removed from (old) LOB table. Superficially, this looks like a reimplementation of TOAST. May look similar, but Datums doesn't support more then 32bit length and size of any field is limited to 1GB, am I right? Serializations is only for casting simple values 1GB, and simple operations, to do not overhead creation of hundreds LOBs. What functionality exactly do you envision that the BLOB and CLOB types would need to have that would warrant treating them different from, say, bytea and text? Actually I thought about less sophisticated support of LOBs, supporting casting and copying data, as well known form other databases idea that LOBs are not downloaded during normal query execution (just ids are taken). Currently, e.g. LOBs are not connected with tables, so deleting rows doesn't delete LOB, table actually holds Oid of large objects, no support for casting to/from LOB, no support for CLOBS. Some drivers try to emulate BLOBs/CLOBs, but it is not perfect, mainly from above reasons. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BLOB support
Tom Lane t...@sss.pgh.pa.us Thursday 02 of June 2011 16:42:42 Robert Haas robertmh...@gmail.com writes: But these problems can be fixed without inventing a completely new system, I think. Or at least we should try. I can see the point of a data type that is really a pointer to a LOB, and the LOB gets deleted when the pointer is removed, but I don't think that should require far-reaching changes all over the system (like relhaslobs) to make it work efficiently. I think you need to start with a problem statement, get agreement that it is a problem and on what the solution should be, and then go write the code to implement that solution. Yes. I think the appropriate problem statement is provide streaming access to large field values, as an alternative to just fetching/storing the entire value at once. I see no good reason to import the entire messy notion of LOBS/CLOBS. (The fact that other databases have done it is not a good reason.) For primitive types like text or bytea it seems pretty obvious what streaming access should entail, but it might be interesting to consider what it should mean for structured types. For instance, if I have an array field with umpteen zillion elements, it might be nice to fetch them one at a time using the streaming access mechanism. I don't say that that has to be in the first version, but it'd be a good idea to keep that in the back of your head so you don't design a dead-end solution that can't be extended in that direction. regards, tom lane In context of LOBs streaming is resolved... I use current LO functionallity (so driver may be able to read LOBs as psql \lo_export does it or using COPY subprotocol) and client should get just LO's id. BLOBs in this implementation, like Robert wanted are just wrapper for core LO, with some extensions for special situations Adding of relhaslob in this impl is quite importnat to do not examine tupledesc for each table operation, but this value may be deduced during relation open (with performance penatly). I saw simillar is made few lines above when triggers are fired, and few lines below when indices are updated. Currently BLOBs may be emulated using core LO (JDBC driver does it), but among everything else, other problems are, if you look from point of view of application developing: 1. No tracking of unused LO (you store just id of such object). You may leak LO after row remove/update. User may write triggers for this, but it is not argument - BLOB type is popular, and it's simplicity of use is quite important. When I create app this is worst thing. 2. No support for casting in UPDATE/INSERT. So there is no way to simple migrate data (e.g. from too long varchars). Or to copy BLOBs. 3. Limitation of field size to 1GB. Other solution, I was think about, is to introduce system triggers (such triggers can't be disabled or removed). So there will be new flag in triggers table. Now I think, we should try to mix both aproches, as system triggers may give interesting API for other developers. Other databases (may) store LOBs, Arrays, and Composites in external tables, so user get's just id of such object. I think about two weaks about streaming, I have some concepts about this, but from point of view of memory consumption and performance. I will send concept later, I want to think a little bit about it once more, and search what can be actually done. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BLOB support
Tom Lane t...@sss.pgh.pa.us Thursday 02 of June 2011 19:43:16 =?utf-8?q?Rados=C5=82aw_Smogura?= rsmog...@softperience.eu writes: Tom Lane t...@sss.pgh.pa.us Thursday 02 of June 2011 16:42:42 Yes. I think the appropriate problem statement is provide streaming access to large field values, as an alternative to just fetching/storing the entire value at once. I see no good reason to import the entire messy notion of LOBS/CLOBS. (The fact that other databases have done it is not a good reason.) In context of LOBs streaming is resolved... I use current LO functionallity (so driver may be able to read LOBs as psql \lo_export does it or using COPY subprotocol) and client should get just LO's id. Just to be clear: I do not want to expose a concept of object IDs for field values in the first place. All of the problems you enumerate stem from the idea that LOBs ought to be a distinct kind of field, and I don't buy that. regards, tom lane So do I understand good should We think about create bettered TOAST to support larger values then 30-bit length? I like this much more, but without Objects ID quering relation with lobs will require to lock relation for some time, as client will need to reference LOB in some way, I think using TID or some derivative of TID, am I right? Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] BLOB support
Hello, I partialy implemented following missing LOBs types. Requirement for this was to give ability to create (B/C)LOB columns and add casting functionality e.g. SET my_clob = 'My long text'. Idea is as follow: 0. Blob is two state object: 1st in memory contains just bytea, serialized contains Oid of large object. 1. Each type has additional boolean haslobs, which is set recursivly. 2. Relation has same bool haslobs (used to speed up tables without LOBs) 3. When data are inserted/updated then special function is called and tuple is modified in this way all LOBs are serialized to (old) LOB table and just Oid is stored. 4. When removed LOB is removed from (old) LOB table. I have implemented: - 0. - 1. and 2. without altering relations and types. - 3. Inserts only, Updates need polishing. No recursion (no support for BLOBs in composites or arrays). Current patch is here (many changes to pg_types.h - new column haslobs added), it's in _early_ stage for those who wants to look inside it. Any ideas or suggestions? Regards, Radek P. S. I'm during removal, and I have limited access to Internet. blob_20110601.patch.bz2 Description: application/bzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improving the memory allocator
I didn't followed this topic carefully, so sorry If I wrote something that was written, but I thought about following approach at least for message sending, etc.: 1. When initializing MemoryContext (pool) give one parameter that will be stack size. Stack is addition to normal operations. 2. Operations on stack are only allocations, so alloc will be just pointer bump. 3. When allocating memory for messages, internally set size to be 1/3 greater (or 2/3 I don't remember well) for realloc, may help when encoding change occurs. 3. free is no op (as it was pointed), stack will be released or pointer resetted, when context is released From one point of view this may waste some memory (no free), but for messages should not happen, and even giving 100kb of memory for stack may be quite enough, depending ofcourse how many data you send. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improving the memory allocator
On Tue, 26 Apr 2011 14:25:10 +0200, Andres Freund wrote: On Tuesday, April 26, 2011 01:59:45 PM Radosław Smogura wrote: I didn't followed this topic carefully, so sorry If I wrote something that was written, but I thought about following approach at least for message sending, etc.: 1. When initializing MemoryContext (pool) give one parameter that will be stack size. Stack is addition to normal operations. You can't allocate memory on the stack and return that. Andres I didn't wrote allocate on stack. I thought about preallocate n bytes chunk for stack-like memory but without popping. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MMAP Buffers
On Sun, 17 Apr 2011 21:06:17 -0400, Robert Haas wrote: On Sun, Apr 17, 2011 at 5:32 PM, Radosław Smogura rsmog...@softperience.eu wrote: Each process has simple mirror of shared descriptors. I believe that modifications to buffer content may be only done when holding exclusive lock (with some simple exceptions) (+ MVCC), actually I saw only two things that can change already loaded data and cause damage, you have described (setting hint bits during scan, and vacuum - 1st may only cause, I think, that two processes will ask for same transaction statuses except vacuum, 2nd one is impossible as vacumm requires exclusive pin). When buffer tag is changed the version of buffer is bumped up, and checked against local version - this about reading buffer. Yes, an exclusive lock is required for substantive content changes. But if vacuum cleaning up the buffer is an issue for your patch, then it's probably also a problem if someone grabs an exclusive content lock and deletes the tuple (by setting XMAX) and some other backend later sees the old buffer contents after having in the meanwhile taken a new snapshot; or if likewise someone grabs an exclusive-lock, adds a tuple, and then your backend takes a new snapshot and then sees the old buffer contents. Basically, any time someone grabs an exclusive-lock and releases it, it's necessary for all observers to see the updated contents by the time the exclusive lock is released. In other cases after obtaining lock check is done if buffer has associated updatable buffer and if local mirror has it too, then swap should take place. I think this check would have to be done every time someone share-locks the buffer, which seems rather expensive. Logic about updatable buffers is similar to shared buffers, each updatable buffer has pin count, and updatable buffer can't be free if someone uses it, but in contrast to normal buffers, updatable buffers doesn't have any support for locking etc. Updatable buffers exists only on free list, or when associated with buffer. I don't see how you're going to get away with removing buffer locks. They exist for a reason, and adding mmap() to the mix is going to require MORE locking, not less. In future, I will change version to shared segment id, something like relation's oid + block, but ids will have continuous numbering 1,2,3..., so I will be able to bypass smgr/md during read, and tag version check - this looks like faster solution. I don't understand this part at all. To my previous post I want to clarify that updatable buffers are implemented in shared memory, so there is no way that process has own copy of data. Regards, Radek. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MMAP Buffers
Tom Lane t...@sss.pgh.pa.us Sunday 17 April 2011 01:35:45 =?utf-8?q?Rados=C5=82aw_Smogura?= rsmog...@softperience.eu writes: No, no, no :) I wanted to do this, but from above reason I skipped it. I swap VM pages, I do remap, in place where the shared buffer was I put mmaped page, and in place where mmaped page was I put shared page (in certain cases, which should be optimized by e. g. read for update, for initial read of page in process I directly points to shared buffer), it can be imagined as I affects TLB. This what I call VM swap is remapping, so I don't change pointers, I change only where this pointers points in physical memory, preserving same pointer in Virtual Memory. ... Huh? Are you saying that you ask the kernel to map each individual shared buffer separately? I can't believe that's going to scale to realistic applications. regards, tom lane No, I do mrempa(mmap_buff_A, MAP_FIXED, temp); mremap(shared_buff_Y, MAP_FIXED, mmap_buff_A), mrempa(tmp, MAP_FIXED, mmap_buff_A). This is this additional overhead - and may have some disadvantages. All regions SysV / Posix MMAP are mapped before. I couldn't believe too, but as I done some work about read, I was in dead corner: 1. Create Read Before Buffer (connect with XLOG) that will store each page before modification (page should be flushed and synced to log) 2. Rewrite whole db to repoint pointers or similar stuff (I done few steps for this). 3. Or find something different. I couldn't believe too, it's way I still work on it. I saw it gains speed for few simple updates. I'm not quite sure why it gets it. I only may think it was from pre update reads. But full checks will go after some good point of updates. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MMAP Buffers
Tom Lane t...@sss.pgh.pa.us Sunday 17 April 2011 17:48:56 =?utf-8?q?Rados=C5=82aw_Smogura?= rsmog...@softperience.eu writes: Tom Lane t...@sss.pgh.pa.us Sunday 17 April 2011 01:35:45 ... Huh? Are you saying that you ask the kernel to map each individual shared buffer separately? I can't believe that's going to scale to realistic applications. No, I do mrempa(mmap_buff_A, MAP_FIXED, temp); mremap(shared_buff_Y, MAP_FIXED, mmap_buff_A), mrempa(tmp, MAP_FIXED, mmap_buff_A). There's no mremap() in the Single Unix Spec, nor on my ancient HPUX box, nor on my quite-up-to-date OS X box. The Linux man page for it says This call is Linux-specific, and should not be used in programs intended to be portable. So if the patch is dependent on that call, it's dead on arrival from a portability standpoint. Good point. This is from initial concept, and actually I done this to do not leave gaps in VM in which library or something could be mmaped. Last time I think about using mmap to replace just one VM page. But in any case, you didn't explain how use of mremap() avoids the problem of the kernel having to maintain a separate page-mapping-table entry for each individual buffer. (Per process, yet.) If that's what's happening, it's going to be a significant performance penalty as well as (I suspect) a serious constraint on how many buffers can be managed. regards, tom lane Kernel merges vm_structs. So mappings are compacted. I'm not kernel specialist, but skipping memory consumption, for not compacted mappings, kernel uses btrees for dealing with TLB, so it should not matter if there is 100 vm_structs or 10 vm_structs. Swap isn't made everywhere. When buffer is initialy read (privaterefcount ==1), then any access to this buffer will directly point to latest valid area. If it has assigned shmem area then this will be used. I plan to add readbuffer for update to prevent swaps, when it's almost sure that buffer will be used for update. I measured performance of page modifications (with unpining, full process on stand alone unit test) it's 2x-3x more time of normal page reads, but this result may not be sure, as I saw memcpy to memory above 2GB is slower then memcpy to first 2GB (this may be idea to try to put some shared structs 2GB). I know that this patch is big question. Sometimes I'm optimistic, and sometimes I'm pessimistic about final result. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MMAP Buffers
Andres Freund and...@anarazel.de Sunday 17 April 2011 20:02:11 On Sunday 17 April 2011 19:26:31 Radosław Smogura wrote: Kernel merges vm_structs. So mappings are compacted. I'm not kernel specialist, but skipping memory consumption, for not compacted mappings, kernel uses btrees for dealing with TLB, so it should not matter if there is 100 vm_structs or 10 vm_structs. But the CPUs TLB cache has maybe 16/256 (1lvl, 2nd) to 64/512 entries. That will mean that there will be cachemisses all over. Additionally your scheme requires flushing it regularly... Andres I only know Phenom has 4096 entries I think and this covers 16MB of memory. But I was taking about memory usage of struct vm_struct in kernel. I tries as well with huge pages, but I can't write really fast allocator for this, it's slower then malloc, maybe from different reasons. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MMAP Buffers
Robert Haas robertmh...@gmail.com Sunday 17 April 2011 22:01:55 On Sun, Apr 17, 2011 at 11:48 AM, Tom Lane t...@sss.pgh.pa.us wrote: =?utf-8?q?Rados=C5=82aw_Smogura?= rsmog...@softperience.eu writes: Tom Lane t...@sss.pgh.pa.us Sunday 17 April 2011 01:35:45 ... Huh? Are you saying that you ask the kernel to map each individual shared buffer separately? I can't believe that's going to scale to realistic applications. No, I do mrempa(mmap_buff_A, MAP_FIXED, temp); mremap(shared_buff_Y, MAP_FIXED, mmap_buff_A), mrempa(tmp, MAP_FIXED, mmap_buff_A). There's no mremap() in the Single Unix Spec, nor on my ancient HPUX box, nor on my quite-up-to-date OS X box. The Linux man page for it says This call is Linux-specific, and should not be used in programs intended to be portable. So if the patch is dependent on that call, it's dead on arrival from a portability standpoint. But in any case, you didn't explain how use of mremap() avoids the problem of the kernel having to maintain a separate page-mapping-table entry for each individual buffer. (Per process, yet.) If that's what's happening, it's going to be a significant performance penalty as well as (I suspect) a serious constraint on how many buffers can be managed. I share your suspicions, although no harm in measuring it. But I don't understand is how this approach avoids the problem of different processes seeing different buffer contents. If backend A has the buffer mmap'd and backend B wants to modify it (and changes the mapping), backend A is still looking at the old buffer contents, isn't it? And then things go boom. Each process has simple mirror of shared descriptors. I believe that modifications to buffer content may be only done when holding exclusive lock (with some simple exceptions) (+ MVCC), actually I saw only two things that can change already loaded data and cause damage, you have described (setting hint bits during scan, and vacuum - 1st may only cause, I think, that two processes will ask for same transaction statuses except vacuum, 2nd one is impossible as vacumm requires exclusive pin). When buffer tag is changed the version of buffer is bumped up, and checked against local version - this about reading buffer. In other cases after obtaining lock check is done if buffer has associated updatable buffer and if local mirror has it too, then swap should take place. Logic about updatable buffers is similar to shared buffers, each updatable buffer has pin count, and updatable buffer can't be free if someone uses it, but in contrast to normal buffers, updatable buffers doesn't have any support for locking etc. Updatable buffers exists only on free list, or when associated with buffer. In future, I will change version to shared segment id, something like relation's oid + block, but ids will have continuous numbering 1,2,3..., so I will be able to bypass smgr/md during read, and tag version check - this looks like faster solution. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MMAP Buffers
Robert Haas robertmh...@gmail.com Monday 18 April 2011 03:06:17 On Sun, Apr 17, 2011 at 5:32 PM, Radosław Smogura rsmog...@softperience.eu wrote: Each process has simple mirror of shared descriptors. I believe that modifications to buffer content may be only done when holding exclusive lock (with some simple exceptions) (+ MVCC), actually I saw only two things that can change already loaded data and cause damage, you have described (setting hint bits during scan, and vacuum - 1st may only cause, I think, that two processes will ask for same transaction statuses except vacuum, 2nd one is impossible as vacumm requires exclusive pin). When buffer tag is changed the version of buffer is bumped up, and checked against local version - this about reading buffer. Yes, an exclusive lock is required for substantive content changes. But if vacuum cleaning up the buffer is an issue for your patch, then it's probably also a problem if someone grabs an exclusive content lock and deletes the tuple (by setting XMAX) and some other backend later sees the old buffer contents after having in the meanwhile taken a new snapshot; or if likewise someone grabs an exclusive-lock, adds a tuple, and then your backend takes a new snapshot and then sees the old buffer contents. Basically, any time someone grabs an exclusive-lock and releases it, it's necessary for all observers to see the updated contents by the time the exclusive lock is released. In other cases after obtaining lock check is done if buffer has associated updatable buffer and if local mirror has it too, then swap should take place. I think this check would have to be done every time someone share-locks the buffer, which seems rather expensive. I don't treat as issues, but it's disadvantage. Logic about updatable buffers is similar to shared buffers, each updatable buffer has pin count, and updatable buffer can't be free if someone uses it, but in contrast to normal buffers, updatable buffers doesn't have any support for locking etc. Updatable buffers exists only on free list, or when associated with buffer. I don't see how you're going to get away with removing buffer locks. They exist for a reason, and adding mmap() to the mix is going to require MORE locking, not less. In future, I will change version to shared segment id, something like relation's oid + block, but ids will have continuous numbering 1,2,3..., so I will be able to bypass smgr/md during read, and tag version check - this looks like faster solution. I don't understand this part at all. Versioning is witch approach where I thought about really often changes of mmaped areas, I allocated part of segments, but now the segment is mmaped with reservation, to it's full possible size, addresses of segments can't change (problem is only with segment deletion). Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MMAP Buffers
Greg Stark gsst...@mit.edu Saturday 16 April 2011 13:00:19 On Sat, Apr 16, 2011 at 7:24 AM, Robert Haas robertmh...@gmail.com wrote: The OP says that this patch maintains the WAL-before-data rule without any explanation of how it accomplishes that seemingly quite amazing feat. I assume I'm going to have to read this patch at some point to refute this assertion, and I think that sucks. I am pretty nearly 100% confident that this approach is utterly doomed, and I don't want to spend a lot of time on it unless someone can provide me with a compelling explanation of why my confidence is misplaced. Fwiw he did explain how he did that. Or at least I think he did -- it's possible I read what I expected because what he came up with is something I've recently been thinking about. What he did, I gather, is treat the mmapped buffers as a read-only copy of the data. To actually make any modifications he copies it into shared buffers and treats them like normal. When the buffers get flushed from memory they get written and then the pointers get repointed back at the mmapped copy. Effectively this means the shared buffers get extended to include all of the filesystem cache instead of having to evict buffers from shared buffers just because you want to read another one that's already in filesystem cache. It doesn't save the copying between filesystem cache and shared buffers for buffers that are actually being written to. But it does save some amount of other copies on read-only traffic and it can even save some i/o. It does require a function call before each buffer modification where the pattern is currently lock buffer, mutate buffer, mark buffer dirty. From what he describes he needs to add a prepare buffer for mutation between the lock and mutate. I think it's an interesting experiment and it's good to know how to solve some of the subproblems. Notably, how do you extend files or drop them atomically across processes? And how do you deal with getting the mappings to be the same across all the processes or deal with them being different? But I don't think it's a great long-term direction. It just seems clunky to have to copy things from mmapped buffers to local buffers and back. Perhaps the performance testing will show that clunkiness is well worth it but we'll need to see that for a wide variety of workloads to judge that. In short words, I swap, exchange (clash of terms) VM pages to prevent pointers (only if needed). I tried to directly point to new memory area, but I saw that some parts of code really depends on memory pointed by original pointers, e.g. Vaccumm uses hint bits setted by previous scan (it depends on this if bit is set or not! so for it it's not only hint). Just from this case I can't assume there is no more such places, so VM pages swap does it for me. Stand alone tests shows for me that this process (with copy from mmap) is 2x-3x time longer then previous. But until someone will not update whole table, then benefit will be taken from pre-update scan, index scans, larger availability of memory (you don't eat cache memory to keep copy of cache in ShMem). Everything may be slower when database fits in ShMem, and similarly (2nd level bufferes may increase performance slightly). I reserve memory for whole segment even if file is smaller. Extending is by wirte one byte at the end of block (here may come deal with Unfiorm Buffer Caches, if I remember name well). For current processors, and current implementation database size is limited to about 260TB (no dynamic segment reservation is performed). Truncation not implemented. Each buffer descriptor has tagVersion to simple check if buffer tag has changed. Descriptors (partially) are mirrored in local memory, and versions are checked. Currently each re-read (is pointed to smgr/md), but introduce shared segment id, and assuming each segment has constant maximum number of blocks, will make it faster (this will be something like current buffer tag), even version field will be unneeded. I saw problems with vacuum, as it reopens relation and I got mappings of same file twice (minor problem). Important will be about deletion, when pointers must invalidated in good way. Regards, Radek. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MMAP Buffers
Tom Lane t...@sss.pgh.pa.us Saturday 16 April 2011 17:02:32 Greg Stark gsst...@mit.edu writes: What he did, I gather, is treat the mmapped buffers as a read-only copy of the data. To actually make any modifications he copies it into shared buffers and treats them like normal. When the buffers get flushed from memory they get written and then the pointers get repointed back at the mmapped copy. That seems much too late --- won't other processes still be looking at the stale mmap'ed version of the page until a write-out happens? No, no, no :) I wanted to do this, but from above reason I skipped it. I swap VM pages, I do remap, in place where the shared buffer was I put mmaped page, and in place where mmaped page was I put shared page (in certain cases, which should be optimized by e. g. read for update, for initial read of page in process I directly points to shared buffer), it can be imagined as I affects TLB. This what I call VM swap is remapping, so I don't change pointers, I change only where this pointers points in physical memory, preserving same pointer in Virtual Memory. if 0x1 is start of buffer 1 (at relation 1, block 1) I have 0x1 - 0x1 + BLCKSZ - mmaped area 0xf1000 - 0xf1000 + BLCKSZ - Shmem SWAP 0x1 - 0x1 + BLCKSZ - Shmem 0xf1000 - 0xf1000 + BLCKSZ - mmaped area It's reason I putted in crash reports /proc/{pid}/maps. For e. g. maps after swap looks like (from crash report): [...] #Data mappings 7fe69b7e3000-7fe69b7ef000 r--s 08:03 3196408 /home/radek/src/postgresql-2nd-level-cache/db/base/12822/12516 7fe69b7ef000-7fe69b7f1000 rw-s 00148000 00:04 8880132 /SYSV0052ea91 (deleted) 7fe69b7f1000-7fe6db7e3000 r--s e000 08:03 3196408 /home/radek/src/postgresql-2nd-level-cache/db/base/12822/12516 [...] #SysV shmem mappings 7fec60788000-7fec6078c000 rw-s 00144000 00:04 8880132 /SYSV0052ea91 (deleted) 7fec6078c000-7fec6078e000 r--s c000 08:03 3196408 /home/radek/src/postgresql-2nd-level-cache/db/base/12822/12516 7fec6078e000-7fec6079c000 rw-s 0014a000 00:04 8880132 /SYSV0052ea91 (deleted) [...] Without swap 12516 should be mapped to one VM region of size equal to BLCKSZ*BLOCKS_PER_SEGMENT (which is about 1GB). When process reads buffer (or after taking lock), the shared buffer descriptor is checked if page was modified (currently is it dirty) if yes do swap, if page is currently in use, or use directly SysV shared areas if pages is just pinned to process. Regards, Radek I'm pretty concerned about the memory efficiency of this too, since it seems like it's making it *guaranteed*, not just somewhat probable, that there are two copies in RAM of every database page that's been modified since the last checkpoint (or so). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] MMAP Buffers
Hello, If I may, I want to share some concept to use mmap in PG. It's far, far away from perfect, but it's keeps WAL before data. As well I crated table, with index, inserted few values, and I done vacuum full on this table. Db inits welcome from orginal sources. Performance of read (if backend is loaded) is really good, query time goes down from 450ms to about 410ms. Update may be slower - but work is in progress (I will start with write, as I went to point when simple updates may be performed). Even that I didn't covered all aspects off updating, it's simple to do it, just to call PreopareBufferToUpdate before modifing buffer, ofc some ideas of increasing this are still in my head. Any comments, suggestions welcome. I didn't included this, as diff, because of ~150kb size (mainly configure scripts, which are included in SVC). Due to this, You may download it from http://softperience.eu/downloads/pg_mmap_20110415.diff.bz2 (Legal: Work under PostgreSQL BSD Lincense). Patch is just GIT diff, later I will try to grab some git. Regards and have a nice day, Radek. P.S. This problem about assert with signals, I wrote... I merged with last master, and rebuilded code. I think, I forgot to rebuild it after previous merge. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MMAP Buffers
On Fri, 15 Apr 2011 14:33:37 +0300, Heikki Linnakangas wrote: On 15.04.2011 13:32, Radosław Smogura wrote: If I may, I want to share some concept to use mmap in PG. It's far, far away from perfect, but it's keeps WAL before data. As well I crated table, with index, inserted few values, and I done vacuum full on this table. Db inits welcome from orginal sources. Performance of read (if backend is loaded) is really good, query time goes down from 450ms to about 410ms. Update may be slower - but work is in progress (I will start with write, as I went to point when simple updates may be performed). Even that I didn't covered all aspects off updating, it's simple to do it, just to call PreopareBufferToUpdate before modifing buffer, ofc some ideas of increasing this are still in my head. Any comments, suggestions welcome. The patch is quite hard to read because of random whitespace changes and other stylistic issues, but I have a couple of high-level questions on the design: Yes, but, hmm... in Netbeans I had really long gaps (probably 8 spaces, from tabs), so deeper ifs, comments at the and of variables, went of out my screen. I really wanted to not format this, but sometimes I needed. * Does each process have its own mmappings, or are the mmapping done to shared buffers? Those are MAP_SHARED mappings, but each process has it's own pointer to this. * How do you handle locking? I do not do locking... I do different thing (worst)... When buffer should be updated, it gets shared buffer, content is copied (so situation almost same like fread), and depending on situation content is used directly or pages between mmaping and shared (updatable) regions are swapped - it keeps tuple pointers, etc. I really would be happy if such method (lock flushing to file) could exists. Do you still need to allocate a shared buffer for each mmapped page? Currently each mmaped page has additional shared buffer, but it's almost ready to use independent pool of shared buffers. This will be good, as mmaped buffers could cover whole system cache, keeping maybe 10%-20% of this size for write in SHMEM. I think about MAP_PRIVATE, but those has some pluses and minuses, e.g. MAP_SHARED may be, for less critical systems, simplier equipped with GUC mmap_direct_write=true. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MMAP Buffers
Joshua Berkus j...@agliodbs.com Friday 15 April 2011 18:55:04 Radoslaw, 10% improvement isn't very impressive from a switch to mmap. What workload did you test with? What I'd really like to see is testing with databases which are 50%, 90% and 200% the size of RAM ... that's where I'd expect the greatest gain from limiting copying. I think 10% is quite good, as my stand-alone test of mmap vs. read shown that speed up of copying 100MB data to mem may be from ~20ms to ~100ms (depends on destination address). Of course deeper, system test simulating real usage will say more. In any case after good deals with writes, I will speed up reads. I think to bypass smgr/md much more and to expose shared id's (1,2,3...) for each file segment. Going to topic... In attachment I sent test-scripts which I used to fill data, nothing complex (left from 2nd level caches). Query I've used to measure was SELECT count(substr(content, 1, 1)) FROM testcase1 WHERE multi_id 5; Timings ware taken from psql. I didn't made load (I have about 2GB of free sapce at /home, and 4GB RAM) and stress (I'm not quite ready to try concurrent updates of same page - may fail, notice is and place to fix is in code) tests yet. Netbeans is possibly not very well suited to working on postgres code. AFAIK emacs and/or vi(m) are used by almost all the major developers. Guys, can we *please* focus on the patch for now, rather than the formatting, which is fixable with sed? Netbeans is quite good, of course it depends who likes what. Just try 7.0 RC 2. Regards, Radek test-scritps_20110319_0026.tar.bz2 Description: application/bzip-compressed-tar -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Jim Nasby j...@nasby.net Thursday 24 March 2011 20:36:48 On Mar 23, 2011, at 5:12 PM, Tom Lane wrote: Robert Haas robertmh...@gmail.com writes: It looks like the only way anything can ever get put on the free list right now is if a relation or database is dropped. That doesn't seem too good. Why not? AIUI the free list is only for buffers that are totally dead, ie contain no info that's possibly of interest to anybody. It is *not* meant to substitute for running the clock sweep when you have to discard a live buffer. Turns out we've had this discussion before: http://archives.postgresql.org/pgsql-hackers/2010-12/msg01088.php and http://archives.postgresql.org/pgsql-hackers/2010-12/msg00689.php Tom made the point in the first one that it might be good to proactively move buffers to the freelist so that backends would normally just have to hit the freelist and not run the sweep. Unfortunately I haven't yet been able to do any performance testing of any of this... perhaps someone else can try and measure the amount of time spent by backends running the clock sweep with different shared buffer sizes. -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net Will not be enough to take spin lock (or make ASM (lock) and increment call for Intels/AMD) around increment StrategyControl-nextVictimBuffer, everything here may be controlled by macro GetNextVictimBuffer(). Within for (;;) the valid buffer may be obtained with modulo NBuffers, to decrease lock time. We may try to calculate how many buffers we had skipped, and decrease e.g. trycount by this value, and put some additional restriction like no more passes then NBuffers*4 calls, and notify error. This will made clock sweep concurrent. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Robert Haas robertmh...@gmail.com Thursday 24 March 2011 22:41:19 On Thu, Mar 24, 2011 at 5:34 PM, Greg Stark gsst...@mit.edu wrote: On Thu, Mar 24, 2011 at 8:59 PM, Robert Haas robertmh...@gmail.com wrote: It seems at least plausible that buffer allocation could be significantly faster if it need only pop the head of a list, rather than scanning until it finds a suitable candidate. Moving as much buffer allocation work as possible into the background seems like it ought to be useful. Linked lists are notoriously non-concurrent, that's the whole reason for the clock sweep algorithm to exist at all instead of just using an LRU directly. That said, an LRU needs to be able to remove elements from the middle and not just enqueue elements on the tail, so the situation isn't exactly equivalent. Just popping off the head is simple enough but the bgwriter would need to be able to add elements to the tail of the list and the people popping elements off the head would need to compete with it for the lock on the list. And I think you need a single lock for the whole list because of the cases where the list becomes a single element or empty. The main impact this list would have is that it would presumably need some real number of buffers to satisfy the pressure for victim buffers for a real amount of time. That would represent a decrease in cache size, effectively evicting buffers from cache as if the cache were smaller by that amount. Theoretical results are that a small change in cache size affects cache hit rates substantially. I'm not sure that's born out by practical experience with Postgres though. People tend to either be doing mostly i/o or very little i/o. Cache hit rate only really matters and is likely to be affected by small changes in cache size in the space in between You wouldn't really have to reduce the effective cache size - there's logic in there to just skip to the next buffer if the first one you pull off the freelist has been reused. But the cache hit rates on those buffers would (you'd hope) be fairly low, since they are the ones we're about to reuse. Maybe it doesn't work out to a win, though. If I may, Under unnormal circumstances (like current process is held by kernel) obtaining buffer from list may be cheaper this code while (StrategyControl-firstFreeBuffer = 0) { buf = BufferDescriptors[StrategyControl-firstFreeBuffer]; Assert(buf-freeNext != FREENEXT_NOT_IN_LIST); /* Unconditionally remove buffer from freelist */ StrategyControl-firstFreeBuffer = buf-freeNext; buf-freeNext = FREENEXT_NOT_IN_LIST; could look do { SpinLock(); if (StrategyControl-firstFreeBuffer = 0) { Unspin() break; } buf = BufferDescriptors[StrategyControl-firstFreeBuffer]; Unspin(); Assert(buf-freeNext != FREENEXT_NOT_IN_LIST); /* Unconditionally remove buffer from freelist */ StrategyControl-firstFreeBuffer = buf-freeNext; buf-freeNext = FREENEXT_NOT_IN_LIST;like this }while(true); and aquirng spin lock for linked list is enaugh, and cheaper then taking lwlock is more complex than spin on this. after this simmilary with spin lock on trycounter = NBuffers*4; for (;;) { spinlock() buf = BufferDescriptors[StrategyControl-nextVictimBuffer]; if (++StrategyControl-nextVictimBuffer = NBuffers) { StrategyControl-nextVictimBuffer = 0; StrategyControl-completePasses++; } unspin(); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Merlin Moncure mmonc...@gmail.com Monday 21 March 2011 20:58:16 On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark gsst...@mit.edu wrote: On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure mmonc...@gmail.com wrote: Can't you make just one large mapping and lock it in 8k regions? I thought the problem with mmap was not being able to detect other processes (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.htm l) compatibility issues (possibly obsolete), etc. I was assuming that locking part of a mapping would force the kernel to split the mapping. It has to record the locked state somewhere so it needs a data structure that represents the size of the locked section and that would, I assume, be the mapping. It's possible the kernel would not in fact fall over too badly doing this. At some point I'll go ahead and do experiments on it. It's a bit fraught though as it the performance may depend on the memory management features of the chipset. That said, that's only part of the battle. On 32bit you can't map the whole database as your database could easily be larger than your address space. I have some ideas on how to tackle that but the simplest test would be to just mmap 8kB chunks everywhere. Even on 64 bit systems you only have 48 bit address space which is not a theoretical limitation. However, at least on linux you can map in and map out pretty quick (10 microseconds paired on my linux vm) so that's not so big of a deal. Dealing with rapidly growing files is a problem. That said, probably you are not going to want to reserve multiple gigabytes in 8k non contiguous chunks. But it's worse than that. Since you're not responsible for flushing blocks to disk any longer you need some way to *unlock* a block when it's possible to be flushed. That means when you flush the xlog you have to somehow find all the blocks that might no longer need to be locked and atomically unlock them. That would require new infrastructure we don't have though it might not be too hard. What would be nice is a mlock_until() where you eventually issue a call to tell the kernel what point in time you've reached and it unlocks everything older than that time. I wonder if there is any reason to mlock at all...if you are going to 'do' mmap, can't you just hide under current lock architecture for actual locking and do direct memory access without mlock? merlin I can't reproduce this. Simple test shows 2x faster read with mmap that read(); I'm sending this what I done with mmap (really ugly, but I'm in forest). It is read only solution, so init database first with some amount of data (I have about 300MB) (2nd level scripts may do this for You). This what I found: 1. If I not require to put new mmap (mmap with FIXED) in previous region (just I do munmap / mmap) with each query, execution time grows, about 10%. 2. Sometimes is enough just to comment or uncomment something that do not have side effects on code flow (bufmgr.c; (un)comment some unused if; put NULL, it will be replaced), and e.g. query execution time may grow 2x. 3. My initial solution, was 2% faster, about 9ms when reading, now it's 10% slower, after making them more usable. Regards, Radek pg_mmap_20110323.patch.bz2 Description: application/bzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Merlin Moncure mmonc...@gmail.com Tuesday 22 March 2011 23:06:02 On Tue, Mar 22, 2011 at 4:28 PM, Radosław Smogura rsmog...@softperience.eu wrote: Merlin Moncure mmonc...@gmail.com Monday 21 March 2011 20:58:16 On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark gsst...@mit.edu wrote: On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure mmonc...@gmail.com wrote: Can't you make just one large mapping and lock it in 8k regions? I thought the problem with mmap was not being able to detect other processes (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.h tm l) compatibility issues (possibly obsolete), etc. I was assuming that locking part of a mapping would force the kernel to split the mapping. It has to record the locked state somewhere so it needs a data structure that represents the size of the locked section and that would, I assume, be the mapping. It's possible the kernel would not in fact fall over too badly doing this. At some point I'll go ahead and do experiments on it. It's a bit fraught though as it the performance may depend on the memory management features of the chipset. That said, that's only part of the battle. On 32bit you can't map the whole database as your database could easily be larger than your address space. I have some ideas on how to tackle that but the simplest test would be to just mmap 8kB chunks everywhere. Even on 64 bit systems you only have 48 bit address space which is not a theoretical limitation. However, at least on linux you can map in and map out pretty quick (10 microseconds paired on my linux vm) so that's not so big of a deal. Dealing with rapidly growing files is a problem. That said, probably you are not going to want to reserve multiple gigabytes in 8k non contiguous chunks. But it's worse than that. Since you're not responsible for flushing blocks to disk any longer you need some way to *unlock* a block when it's possible to be flushed. That means when you flush the xlog you have to somehow find all the blocks that might no longer need to be locked and atomically unlock them. That would require new infrastructure we don't have though it might not be too hard. What would be nice is a mlock_until() where you eventually issue a call to tell the kernel what point in time you've reached and it unlocks everything older than that time. I wonder if there is any reason to mlock at all...if you are going to 'do' mmap, can't you just hide under current lock architecture for actual locking and do direct memory access without mlock? merlin Actually after dealing with mmap and adding munmap I found crucial thing why to not use mmap: You need to munmap, and for me this takes much time, even if I read with SHARED | PROT_READ, it's looks like Linux do flush or something else, same as with MAP_FIXED, MAP_PRIVATE, etc. can you produce small program demonstrating the problem? This is not how things should work AIUI. I was thinking about playing with mmap implementation of clog system -- it's perhaps better fit. clog is rigidly defined size, and has very high performance requirements. Also it's much less changes than reimplementing heap buffering, and maybe not so much affected by munmap. merlin Ah... just one thing, maybe usefull why performance is lost with huge memory. I saw mmaped buffers are allocated in something like 0x007, so definitly above 4gb. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Greg Stark gsst...@mit.edu Wednesday 23 March 2011 21:30:04 On Wed, Mar 23, 2011 at 8:00 PM, Robert Haas robertmh...@gmail.com wrote: It looks like the only way anything can ever get put on the free list right now is if a relation or database is dropped. That doesn't seem too good. I wonder if the background writer shouldn't be trying to maintain the free list. That is, perhaps BgBufferSync() should notice when the number of free buffers drops below some threshold, and run the clock sweep enough to get it back up to that threshold. I think this is just a terminology discrepancy. In postgres the free list is only used for buffers that contain no useful data at all. The only time there are buffers on the free list is at startup or if a relation or database is dropped. Most of the time blocks are read into buffers that already contain other data. Candidate buffers to evict are buffers that have been used least recently. That's what the clock sweep implements. What the bgwriter's responsible for is looking at the buffers *ahead* of the clock sweep and flushing them to disk. They stay in ram and don't go on the free list, all that changes is that they're clean and therefore can be reused without having to do any i/o. I'm a bit skeptical that this works because as soon as bgwriter saturates the i/o the os will throttle the rate at which it can write. When that happens even a few dozens of milliseconds will be plenty to allow the purely user-space processes consuming the buffers to catch up instantly. But Greg Smith has done a lot of work tuning the bgwriter so that it is at least useful in some circumstances. I could well see it being useful for systems where latency matters and the i/o is not saturated. Freelist is almost useless under normal operations, but it's only one check if it's empty or not, which could be optimized by checking (... -1), or !(... 0) Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Merlin Moncure mmonc...@gmail.com Monday 21 March 2011 20:58:16 On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark gsst...@mit.edu wrote: On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure mmonc...@gmail.com wrote: Can't you make just one large mapping and lock it in 8k regions? I thought the problem with mmap was not being able to detect other processes (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.htm l) compatibility issues (possibly obsolete), etc. I was assuming that locking part of a mapping would force the kernel to split the mapping. It has to record the locked state somewhere so it needs a data structure that represents the size of the locked section and that would, I assume, be the mapping. It's possible the kernel would not in fact fall over too badly doing this. At some point I'll go ahead and do experiments on it. It's a bit fraught though as it the performance may depend on the memory management features of the chipset. That said, that's only part of the battle. On 32bit you can't map the whole database as your database could easily be larger than your address space. I have some ideas on how to tackle that but the simplest test would be to just mmap 8kB chunks everywhere. Even on 64 bit systems you only have 48 bit address space which is not a theoretical limitation. However, at least on linux you can map in and map out pretty quick (10 microseconds paired on my linux vm) so that's not so big of a deal. Dealing with rapidly growing files is a problem. That said, probably you are not going to want to reserve multiple gigabytes in 8k non contiguous chunks. But it's worse than that. Since you're not responsible for flushing blocks to disk any longer you need some way to *unlock* a block when it's possible to be flushed. That means when you flush the xlog you have to somehow find all the blocks that might no longer need to be locked and atomically unlock them. That would require new infrastructure we don't have though it might not be too hard. What would be nice is a mlock_until() where you eventually issue a call to tell the kernel what point in time you've reached and it unlocks everything older than that time. I wonder if there is any reason to mlock at all...if you are going to 'do' mmap, can't you just hide under current lock architecture for actual locking and do direct memory access without mlock? merlin Actually after dealing with mmap and adding munmap I found crucial thing why to not use mmap: You need to munmap, and for me this takes much time, even if I read with SHARED | PROT_READ, it's looks like Linux do flush or something else, same as with MAP_FIXED, MAP_PRIVATE, etc. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Merlin Moncure mmonc...@gmail.com Monday 21 March 2011 20:58:16 On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark gsst...@mit.edu wrote: On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure mmonc...@gmail.com wrote: Can't you make just one large mapping and lock it in 8k regions? I thought the problem with mmap was not being able to detect other processes (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.htm l) compatibility issues (possibly obsolete), etc. I was assuming that locking part of a mapping would force the kernel to split the mapping. It has to record the locked state somewhere so it needs a data structure that represents the size of the locked section and that would, I assume, be the mapping. It's possible the kernel would not in fact fall over too badly doing this. At some point I'll go ahead and do experiments on it. It's a bit fraught though as it the performance may depend on the memory management features of the chipset. That said, that's only part of the battle. On 32bit you can't map the whole database as your database could easily be larger than your address space. I have some ideas on how to tackle that but the simplest test would be to just mmap 8kB chunks everywhere. Even on 64 bit systems you only have 48 bit address space which is not a theoretical limitation. However, at least on linux you can map in and map out pretty quick (10 microseconds paired on my linux vm) so that's not so big of a deal. Dealing with rapidly growing files is a problem. That said, probably you are not going to want to reserve multiple gigabytes in 8k non contiguous chunks. But it's worse than that. Since you're not responsible for flushing blocks to disk any longer you need some way to *unlock* a block when it's possible to be flushed. That means when you flush the xlog you have to somehow find all the blocks that might no longer need to be locked and atomically unlock them. That would require new infrastructure we don't have though it might not be too hard. What would be nice is a mlock_until() where you eventually issue a call to tell the kernel what point in time you've reached and it unlocks everything older than that time. Sorry for curious, but I think mlock is for swap prevent not for flush prevent. I wonder if there is any reason to mlock at all...if you are going to 'do' mmap, can't you just hide under current lock architecture for actual locking and do direct memory access without mlock? merlin mmap man do not say anything about when flush occurs when mmap is file and is shared, so flushes may be intended or not. Much more, this what I read, SysV shared memory is emulated by mmap (and I think this mmap is on /dev/shm) Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Kevin Grittner kevin.gritt...@wicourts.gov Thursday 17 March 2011 22:02:18 Rados*aw Smogurarsmog...@softperience.eu wrote: I have implemented initial concept of 2nd level cache. Idea is to keep some segments of shared memory for special buffers (e.g. indices) to prevent overwrite those by other operations. I added those functionality to nbtree index scan. I tested this with doing index scan, seq read, drop system buffers, do index scan and in few places I saw performance improvements, but actually, I'm not sure if this was just random or intended improvement. I've often wondered about this. In a database I developed back in the '80s it was clearly a win to have a special cache for index entries and other special pages closer to the database than the general cache. A couple things have changed since the '80s (I mean, besides my waistline and hair color), and PostgreSQL has many differences from that other database, so I haven't been sure it would help as much, but I have wondered. I can't really look at this for a couple weeks, but I'm definitely interested. I suggest that you add this to the next CommitFest as a WIP patch, under the Performance category. https://commitfest.postgresql.org/action/commitfest_view/open There is few places to optimize code as well, and patch need many work, but may you see it and give opinions? For something like this it makes perfect sense to show proof of concept before trying to cover everything. -Kevin Here I attach latest version of patch with few performance improvements (code is still dirty) and some reports from test, as well my simple tests. Actually there is small improvement without dropping system caches, and bigger with dropping. I have small performance decrease (if we can talk about measuring basing on this tests) to original PG version when dealing with same configuration, but increase is with 2nd level buffers... or maybe I badly compared reports. In tests I tried to choose typical, simple queries. Regards, Radek 2nd_lvl_cache_20110318.diff.bz2 Description: application/bzip test-scritps_20110319_0026.tar.bz2 Description: application/bzip-compressed-tar reports_20110318.tar.bz2 Description: application/bzip-compressed-tar -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] 2nd Level Buffer Cache
Hi, I have implemented initial concept of 2nd level cache. Idea is to keep some segments of shared memory for special buffers (e.g. indices) to prevent overwrite those by other operations. I added those functionality to nbtree index scan. I tested this with doing index scan, seq read, drop system buffers, do index scan and in few places I saw performance improvements, but actually, I'm not sure if this was just random or intended improvement. There is few places to optimize code as well, and patch need many work, but may you see it and give opinions? Regards, Radek diff --git a/.gitignore b/.gitignore index 3f11f2e..6542e35 100644 --- a/.gitignore +++ b/.gitignore @@ -22,3 +22,4 @@ lcov.info /GNUmakefile /config.log /config.status +/nbproject/private/ \ No newline at end of file diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c index 2796445..0229f5a 100644 --- a/src/backend/access/nbtree/nbtpage.c +++ b/src/backend/access/nbtree/nbtpage.c @@ -508,7 +508,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access) if (blkno != P_NEW) { /* Read an existing block of the relation */ - buf = ReadBuffer(rel, blkno); + buf = ReadBufferLevel(rel, blkno, BUFFER_LEVEL_2ND); LockBuffer(buf, access); _bt_checkpage(rel, buf); } @@ -548,7 +548,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access) blkno = GetFreeIndexPage(rel); if (blkno == InvalidBlockNumber) break; - buf = ReadBuffer(rel, blkno); + buf = ReadBufferLevel(rel, blkno, BUFFER_LEVEL_2ND); if (ConditionalLockBuffer(buf)) { page = BufferGetPage(buf); diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c index dadb49d..2922711 100644 --- a/src/backend/storage/buffer/buf_init.c +++ b/src/backend/storage/buffer/buf_init.c @@ -22,6 +22,7 @@ BufferDesc *BufferDescriptors; char *BufferBlocks; int32 *PrivateRefCount; +BufferLevelDesc *bufferLevels; /* * Data Structures: @@ -72,8 +73,7 @@ int32 *PrivateRefCount; void InitBufferPool(void) { - bool foundBufs, -foundDescs; + bool foundBufs, foundDescs, foundBufferLevels = false; BufferDescriptors = (BufferDesc *) ShmemInitStruct(Buffer Descriptors, @@ -83,19 +83,38 @@ InitBufferPool(void) ShmemInitStruct(Buffer Blocks, NBuffers * (Size) BLCKSZ, foundBufs); - if (foundDescs || foundBufs) +bufferLevels = (BufferLevelDesc*) +ShmemInitStruct(Buffer Levels Descriptors Table, + sizeof(BufferLevelDesc) * BUFFER_LEVEL_SIZE, +foundBufferLevels); + if (foundDescs || foundBufs || foundBufferLevels) { /* both should be present or neither */ - Assert(foundDescs foundBufs); + Assert(foundDescs foundBufs foundBufferLevels); /* note: this path is only taken in EXEC_BACKEND case */ } else { BufferDesc *buf; +BufferLevelDesc *bufferLevelDesc; + int i; - + buf = BufferDescriptors; +/* Initialize buffer levels. */ +//1st Level - Default +bufferLevelDesc = bufferLevels; +bufferLevelDesc-index = 0; +bufferLevelDesc-super = BUFFER_LEVEL_END_OF_LIST; +bufferLevelDesc-lower = BUFFER_LEVEL_END_OF_LIST; + +//2nd Level - For indices +bufferLevelDesc++; +bufferLevelDesc-index = 1; +bufferLevelDesc-super = BUFFER_LEVEL_END_OF_LIST; +bufferLevelDesc-lower = 0; + /* * Initialize all the buffer headers. */ @@ -117,6 +136,10 @@ InitBufferPool(void) */ buf-freeNext = i + 1; +/* Assign buffer level. */ +//TODO Currently hardcoded - +buf-buf_level = ( 0.3 * NBuffers i ) ? BUFFER_LEVEL_DEFAULT : BUFFER_LEVEL_2ND; + buf-io_in_progress_lock = LWLockAssign(); buf-content_lock = LWLockAssign(); } diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 1f89e52..867bae0 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -47,7 +47,8 @@ #include storage/standby.h #include utils/rel.h #include utils/resowner.h - +#include catalog/pg_type.h +#include funcapi.h /* Note: these two macros only work on shared buffers, not local ones! */ #define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)-buf_id) * BLCKSZ)) @@ -85,7 +86,7 @@ static volatile BufferDesc *PinCountWaitBuf = NULL; static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence, ForkNumber forkNum, BlockNumber blockNum, ReadBufferMode mode, BufferAccessStrategy strategy, - bool *hit); + bool *hit, BufferLevel bufferLevel); static bool PinBuffer(volatile BufferDesc *buf,
Re: [HACKERS] Binary in/out for aclitem
Tom Lane t...@sss.pgh.pa.us Wednesday 23 February 2011 22:30:04 =?utf-8?q?Rados=C5=82aw_Smogura?= rsmog...@softperience.eu writes: Here is extended version, has version field (N_ACL_RIGHTS*2) and reserved mask, as well definition is more general then def of PGSQL. In any way it require that rights mades bit array. You're going in quite the wrong direction here. The consensus as I understood it was that we should just use the text representation in binary mode too, rather than inventing a separate representation that's going to put a whole new set of constraints on what can happen to the internal representation. The proposal you have here has no redeeming social value whatever, because nobody cares about the I/O efficiency for aclitem (and even if anyone did, you've made no case that this would actually be more efficient to use on the client side). regards, tom lane Look at it. Pass call to in/out. Regards, Radek diff --git a/.gitignore b/.gitignore index 1be11e8..0d594f9 100644 --- a/.gitignore +++ b/.gitignore @@ -17,3 +17,5 @@ objfiles.txt /GNUmakefile /config.log /config.status +/nbproject/private/ +/nbproject diff --git a/src/backend/utils/adt/acl.c b/src/backend/utils/adt/acl.c index 691ba3b..fa151cd 100644 --- a/src/backend/utils/adt/acl.c +++ b/src/backend/utils/adt/acl.c @@ -33,6 +33,7 @@ #include utils/lsyscache.h #include utils/memutils.h #include utils/syscache.h +#include libpq/pqformat.h typedef struct @@ -77,7 +78,8 @@ static const char *getid(const char *s, char *n); static void putid(char *p, const char *s); static Acl *allocacl(int n); static void check_acl(const Acl *acl); -static const char *aclparse(const char *s, AclItem *aip); +static const char *aclparse(const char *s, AclItem *aip, bool binary); +static Datum aclitem_common_in_recv(const char* s, bool binary); static bool aclitem_match(const AclItem *a1, const AclItem *a2); static int aclitemComparator(const void *arg1, const void *arg2); static void check_circularity(const Acl *old_acl, const AclItem *mod_aip, @@ -222,6 +224,8 @@ putid(char *p, const char *s) * * This routine is called by the parser as well as aclitemin(), hence * the added generality. + * + * @param binary if we parse for binary mode or text mode * * RETURNS: * the string position in 's' immediately following the ACL @@ -230,7 +234,7 @@ putid(char *p, const char *s) * UID/GID, id type identifier and mode type values. */ static const char * -aclparse(const char *s, AclItem *aip) +aclparse(const char *s, AclItem *aip, bool binary) { AclMode privs, goption, @@ -249,20 +253,20 @@ aclparse(const char *s, AclItem *aip) /* we just read a keyword, not a name */ if (strcmp(name, group) != 0 strcmp(name, user) != 0) ereport(ERROR, - (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + (errcode((binary ? ERRCODE_INVALID_BINARY_REPRESENTATION : ERRCODE_INVALID_TEXT_REPRESENTATION)), errmsg(unrecognized key word: \%s\, name), errhint(ACL key word must be \group\ or \user\.))); s = getid(s, name); /* move s to the name beyond the keyword */ if (name[0] == '\0') ereport(ERROR, - (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + (errcode((binary ? ERRCODE_INVALID_BINARY_REPRESENTATION : ERRCODE_INVALID_TEXT_REPRESENTATION)), errmsg(missing name), errhint(A name must follow the \group\ or \user\ key word.))); } if (*s != '=') ereport(ERROR, -(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), +(errcode((binary ? ERRCODE_INVALID_BINARY_REPRESENTATION : ERRCODE_INVALID_TEXT_REPRESENTATION)), errmsg(missing \=\ sign))); privs = goption = ACL_NO_RIGHTS; @@ -315,7 +319,7 @@ aclparse(const char *s, AclItem *aip) break; default: ereport(ERROR, - (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + (errcode((binary ? ERRCODE_INVALID_BINARY_REPRESENTATION : ERRCODE_INVALID_TEXT_REPRESENTATION)), errmsg(invalid mode character: must be one of \%s\, ACL_ALL_RIGHTS_STR))); } @@ -337,7 +341,7 @@ aclparse(const char *s, AclItem *aip) s = getid(s + 1, name2); if (name2[0] == '\0') ereport(ERROR, - (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), + (errcode((binary ? ERRCODE_INVALID_BINARY_REPRESENTATION : ERRCODE_INVALID_TEXT_REPRESENTATION)), errmsg(a name must follow the \/\ sign))); aip-ai_grantor = get_role_oid(name2, false); } @@ -548,6 +552,22 @@ check_acl(const Acl *acl) errmsg(ACL arrays must not contain null values))); } +static +Datum aclitem_common_in_recv(const char* s, bool binary) +{ +AclItem*aip; + +aip = (AclItem *) palloc(sizeof(AclItem)); +s = aclparse(s, aip, binary); +while (isspace((unsigned char) *s)) +++s; +if (*s) +ereport(ERROR, +(errcode((binary ? ERRCODE_INVALID_BINARY_REPRESENTATION : ERRCODE_INVALID_TEXT_REPRESENTATION)), +
Re: [HACKERS] Binary in/out for aclitem
Tom Lane t...@sss.pgh.pa.us Wednesday 23 February 2011 16:19:27 rsmogura rsmog...@softperience.eu writes: On Tue, 22 Feb 2011 20:20:39 -0500, Tom Lane wrote: ... But my question isn't about that; it's about why aclitem should be considered a first-class citizen. It makes me uncomfortable that client apps are looking at it at all, because any that do are bound to get broken in the future, even assuming that they get the right answers today. I wonder how many such clients are up to speed for per-column privileges and non-constant default privileges for instance. And sepgsql is going to cut them off at the knees. Technically, at eye glance, I didn't seen in sepgsql modifications to acl.h. So, I think, aclitem will be unaffected. In any way sepgsql needs some way to present access rights to administrator it may use own model, or aclitem, too. You're missing the point, which is that the current internal representation of aclitem could change drastically to support future feature improvements in the area of privileges. It has already changed significantly in the past (we didn't use to have WITH GRANT OPTION). If we had to add a field, for instance, a binary representation would simply be broken, as clients would have difficulty telling how to interpret it as soon as there was more than one possible format. Text representations are typically a bit more extensible. regards, tom lane I removed from patch this (think like currently not needed, but it is enaught to put in doc) Each privilige has idividual number P from 1 to n. and it is represented by setted P-th bit. First n-th bits (in network bit order) represents normal priv, next n-th bits represents grant option of privs. This chain is encoded as n*2 bit number rounded up to full 8 bits, with minimal length 32 bit. I was thinking about adding number of all privs to each ACL item, removed as this could be deducted from PG version, where 1st 7-bit represents version, last 8-th bit will represent if grant part has been added. --- In any way binary output should be available, if we have binary mode. I know that text is more extensible, we may in contrast to above packed version, describes acl privs as byte array elements from represented setted priv (same as text). Fallback solution is to just recall aclin/aclout with StringInfo. Regards, Radek. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Binary in/out for aclitem
Tom Lane t...@sss.pgh.pa.us Wednesday 23 February 2011 16:19:27 rsmogura rsmog...@softperience.eu writes: On Tue, 22 Feb 2011 20:20:39 -0500, Tom Lane wrote: ... But my question isn't about that; it's about why aclitem should be considered a first-class citizen. It makes me uncomfortable that client apps are looking at it at all, because any that do are bound to get broken in the future, even assuming that they get the right answers today. I wonder how many such clients are up to speed for per-column privileges and non-constant default privileges for instance. And sepgsql is going to cut them off at the knees. Technically, at eye glance, I didn't seen in sepgsql modifications to acl.h. So, I think, aclitem will be unaffected. In any way sepgsql needs some way to present access rights to administrator it may use own model, or aclitem, too. You're missing the point, which is that the current internal representation of aclitem could change drastically to support future feature improvements in the area of privileges. It has already changed significantly in the past (we didn't use to have WITH GRANT OPTION). If we had to add a field, for instance, a binary representation would simply be broken, as clients would have difficulty telling how to interpret it as soon as there was more than one possible format. Text representations are typically a bit more extensible. regards, tom lane Actully, You litlle messed in my head. So in prev post we don't need to send information if grant option has been set, currently in text mode no grant options means ACL_NO_RIGHTS, and in binary same may be achived be settig there 0. But version field may be usefull to validate this and future calls, and provide backward compatibility (if newer client will send less bits then rest of bits will be set to 0). I think about splitting privs chain to two numbers, it may be easier to implement this and parse if number of privs will extend 32... In addition I may add support for possible, future representation, where given privilige may be yes, no, undefined (like in Windows). Regrads, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Binary in/out for aclitem
Tom Lane t...@sss.pgh.pa.us Wednesday 23 February 2011 16:19:27 rsmogura rsmog...@softperience.eu writes: On Tue, 22 Feb 2011 20:20:39 -0500, Tom Lane wrote: ... But my question isn't about that; it's about why aclitem should be considered a first-class citizen. It makes me uncomfortable that client apps are looking at it at all, because any that do are bound to get broken in the future, even assuming that they get the right answers today. I wonder how many such clients are up to speed for per-column privileges and non-constant default privileges for instance. And sepgsql is going to cut them off at the knees. Technically, at eye glance, I didn't seen in sepgsql modifications to acl.h. So, I think, aclitem will be unaffected. In any way sepgsql needs some way to present access rights to administrator it may use own model, or aclitem, too. You're missing the point, which is that the current internal representation of aclitem could change drastically to support future feature improvements in the area of privileges. It has already changed significantly in the past (we didn't use to have WITH GRANT OPTION). If we had to add a field, for instance, a binary representation would simply be broken, as clients would have difficulty telling how to interpret it as soon as there was more than one possible format. Text representations are typically a bit more extensible. regards, tom lane Here is extended version, has version field (N_ACL_RIGHTS*2) and reserved mask, as well definition is more general then def of PGSQL. In any way it require that rights mades bit array. Still I tested only aclitemsend. Btw, Is it possible and needed to add group byte, indicating that grantee is group or user? Regards, Radek diff --git a/.gitignore b/.gitignore index 1be11e8..0d594f9 100644 --- a/.gitignore +++ b/.gitignore @@ -17,3 +17,5 @@ objfiles.txt /GNUmakefile /config.log /config.status +/nbproject/private/ +/nbproject diff --git a/src/backend/utils/adt/acl.c b/src/backend/utils/adt/acl.c index 691ba3b..c25c0fd 100644 --- a/src/backend/utils/adt/acl.c +++ b/src/backend/utils/adt/acl.c @@ -33,6 +33,7 @@ #include utils/lsyscache.h #include utils/memutils.h #include utils/syscache.h +#include libpq/pqformat.h typedef struct @@ -78,6 +79,10 @@ static void putid(char *p, const char *s); static Acl *allocacl(int n); static void check_acl(const Acl *acl); static const char *aclparse(const char *s, AclItem *aip); + +/** Assigns default grantor and send warning. */ +static void aclitem_assign_default_grantor(AclItem *aip); + static bool aclitem_match(const AclItem *a1, const AclItem *a2); static int aclitemComparator(const void *arg1, const void *arg2); static void check_circularity(const Acl *old_acl, const AclItem *mod_aip, @@ -209,6 +214,14 @@ putid(char *p, const char *s) *p = '\0'; } +/** Assigns default grantor and send warning. */ +void aclitem_assign_default_grantor(AclItem *aip) { +aip-ai_grantor = BOOTSTRAP_SUPERUSERID; +ereport(WARNING, +(errcode(ERRCODE_INVALID_GRANTOR), + errmsg(defaulting grantor to user ID %u, +BOOTSTRAP_SUPERUSERID))); +} /* * aclparse * Consumes and parses an ACL specification of the form: @@ -343,11 +356,7 @@ aclparse(const char *s, AclItem *aip) } else { - aip-ai_grantor = BOOTSTRAP_SUPERUSERID; - ereport(WARNING, -(errcode(ERRCODE_INVALID_GRANTOR), - errmsg(defaulting grantor to user ID %u, - BOOTSTRAP_SUPERUSERID))); +aclitem_assign_default_grantor(aip); } ACLITEM_SET_PRIVS_GOPTIONS(*aip, privs, goption); @@ -643,6 +652,163 @@ aclitemout(PG_FUNCTION_ARGS) PG_RETURN_CSTRING(out); } +/** Do binary read of aclitem. Input format is same as {@link aclitem_recv}, but + * special algorithm is used to determine grantee's and grantor's OID. The reason + * is to keep backward information compatiblity with text mode - typical + * client (which gets instructions from user) + * may be much more interested in sending grantee and grantors name then + * OID. Detailed rule is as follow:br/ + * If message has no name and names' length then + * use passed OIDs (message may be truncated, we accept this, + * but both, two last fields must be not present).br/ + * If grantee's name len or grantor's name len is {@code -1} then use respecitve + * OIDs.br/ + * If name length is not {@code -1} then find OID for given part, and + * ensure that respective OID is {@code 0} or is equal to found OID./br + * If grantor's OID is {@code 0} and grantor's name lenght is {@code -1} or + * truncated then assign superuser as grantor. + */ +Datum +aclitemrecv(PG_FUNCTION_ARGS) { +StringInfo buf = (StringInfo) PG_GETARG_POINTER(0); +AclItem*aip; +intgRawLen; +char *gVal = NULL; +int4gValLen; +OidgOid; +int2 numberOfAcls; +int4 mask;
[HACKERS] Binary in/out for aclitem
Hi, Actaully one more POD left it's aclitem :). In Java for e.g. it is used to obtain column priviliges, I assume some folks may want to use it too. I tested only recv :-( Acually I don't know if idea of such format is OK, but my intention was to send roles names, so driver don't need to ask for name (like for regproc) and to keep together human-based approch (text) and binary approch. Kind regards, Radek diff --git a/.gitignore b/.gitignore index 1be11e8..0d594f9 100644 --- a/.gitignore +++ b/.gitignore @@ -17,3 +17,5 @@ objfiles.txt /GNUmakefile /config.log /config.status +/nbproject/private/ +/nbproject diff --git a/src/backend/utils/adt/acl.c b/src/backend/utils/adt/acl.c index 691ba3b..33877cb 100644 --- a/src/backend/utils/adt/acl.c +++ b/src/backend/utils/adt/acl.c @@ -33,6 +33,7 @@ #include utils/lsyscache.h #include utils/memutils.h #include utils/syscache.h +#include libpq/pqformat.h typedef struct @@ -78,6 +79,10 @@ static void putid(char *p, const char *s); static Acl *allocacl(int n); static void check_acl(const Acl *acl); static const char *aclparse(const char *s, AclItem *aip); + +/** Assigns default grantor and send warning. */ +static void aclitem_assign_default_grantor(AclItem *aip); + static bool aclitem_match(const AclItem *a1, const AclItem *a2); static int aclitemComparator(const void *arg1, const void *arg2); static void check_circularity(const Acl *old_acl, const AclItem *mod_aip, @@ -209,6 +214,14 @@ putid(char *p, const char *s) *p = '\0'; } +/** Assigns default grantor and send warning. */ +void aclitem_assign_default_grantor(AclItem *aip) { +aip-ai_grantor = BOOTSTRAP_SUPERUSERID; +ereport(WARNING, +(errcode(ERRCODE_INVALID_GRANTOR), + errmsg(defaulting grantor to user ID %u, +BOOTSTRAP_SUPERUSERID))); +} /* * aclparse * Consumes and parses an ACL specification of the form: @@ -343,11 +356,7 @@ aclparse(const char *s, AclItem *aip) } else { - aip-ai_grantor = BOOTSTRAP_SUPERUSERID; - ereport(WARNING, -(errcode(ERRCODE_INVALID_GRANTOR), - errmsg(defaulting grantor to user ID %u, - BOOTSTRAP_SUPERUSERID))); +aclitem_assign_default_grantor(aip); } ACLITEM_SET_PRIVS_GOPTIONS(*aip, privs, goption); @@ -643,6 +652,143 @@ aclitemout(PG_FUNCTION_ARGS) PG_RETURN_CSTRING(out); } +/** Do binary read of aclitem. Input format is same as {@link aclitem_recv}, but + * special algorithm is used to determine grantee's and grantor's OID. The reason + * is to keep backward information compatiblity with text mode - typical + * client (which gets instructions from user) + * may be much more interested in sending grantee and grantors name then + * OID. Detailed rule is as follow:br/ + * If message contains bonly/b oids and privs (no name and names' length) then + * use passed OIDs (message may be truncated, we accept this, + * but both, two last fields must be not present).br/ + * If grantee's name len or grantor's name len is {@code -1} then use respecitve + * OIDs.br/ + * If name length is not {@code -1} then find OID for given part, and + * ensure that respective OID is {@code 0} or is equal to found OID./br + * If grantor's OID is {@code 0} and grantor's name lenght is {@code -1} or + * truncated then assign superuser as grantor. + */ +Datum +aclitemrecv(PG_FUNCTION_ARGS) { +StringInfo buf = (StringInfo) PG_GETARG_POINTER(0); +AclItem*aip; +intgRawLen; +char *gVal = NULL; +intgValLen; +OidgOid; + +aip = (AclItem *) palloc(sizeof(AclItem)); +aip-ai_grantee = pq_getmsgint(buf, 4); +aip-ai_grantor = pq_getmsgint(buf, 4); +aip-ai_privs = pq_getmsgint(buf, 4); + +/* Client has passed names. */ +if (buf-cursor buf-len) { +/* Read grantee. */ +gRawLen = pq_getmsgint(buf, 4); +if (gRawLen != -1) { +gVal = pq_getmsgtext(buf, gRawLen, gValLen); +gOid = get_role_oid(gVal, false); +if (aip-ai_grantee != 0 aip-ai_grantee != gOid) { +ereport(ERROR, +(errcode(ERRCODE_INVALID_BINARY_REPRESENTATION), +errmsg(grantee's OID is not 0 and passed grantee's name and grantee's OID doesn't match))); + +} +} + +/* Read grantor. */ +gRawLen = pq_getmsgint(buf, 4); +if (gRawLen != -1) { +gVal = pq_getmsgtext(buf, gRawLen, gValLen); +gOid = get_role_oid(gVal, false); +if (aip-ai_grantor != 0 aip-ai_grantor != gOid) { +ereport(ERROR, +(errcode(ERRCODE_INVALID_BINARY_REPRESENTATION), +errmsg(grantor's OID is not 0 and passed grantor's name and grantor's OID doesn't match))); +} +}else { +gVal = NULL; +} +} + +if
[HACKERS] Void binary patch
Just patch for missing procedures for void send/recv Regards, Radek diff --git a/src/backend/utils/adt/pseudotypes.c b/src/backend/utils/adt/pseudotypes.c index d9329f8..614eb98 100644 --- a/src/backend/utils/adt/pseudotypes.c +++ b/src/backend/utils/adt/pseudotypes.c @@ -212,7 +212,20 @@ void_out(PG_FUNCTION_ARGS) PG_RETURN_CSTRING(pstrdup()); } +Datum +void_recv(PG_FUNCTION_ARGS) +{ +PG_RETURN_VOID(); +} +Datum +void_send(PG_FUNCTION_ARGS) +{ + StringInfoData buf; + +pq_begintypsend(buf); //Nice machinery to send nothing + PG_RETURN_BYTEA_P(pq_endtypsend(buf)); +} /* * trigger_in - input routine for pseudo-type TRIGGER. */ diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index 0894985..0711474 100644 --- a/src/include/catalog/pg_proc.h +++ b/src/include/catalog/pg_proc.h @@ -4226,6 +4226,11 @@ DATA(insert OID = 2502 ( anyarray_recv PGNSP PGUID 12 1 0 0 f f f t f s 1 0 DESCR(I/O); DATA(insert OID = 2503 ( anyarray_send PGNSP PGUID 12 1 0 0 f f f t f s 1 0 17 2277 _null_ _null_ _null_ _null_ anyarray_send _null_ _null_ _null_ )); DESCR(I/O); +DATA(insert OID = 3120 ( void_recv PGNSP PGUID 12 1 0 0 f f f t f s 1 0 2278 2281 _null_ _null_ _null_ _null_ void_recv _null_ _null_ _null_ )); +DESCR(I/O); +DATA(insert OID = 3121 ( void_send PGNSP PGUID 12 1 0 0 f f f t f s 1 0 17 2278 _null_ _null_ _null_ _null_ void_send _null_ _null_ _null_ )); +DESCR(I/O); + /* System-view support functions with pretty-print option */ DATA(insert OID = 2504 ( pg_get_ruledef PGNSP PGUID 12 1 0 0 f f f t f s 2 0 25 26 16 _null_ _null_ _null_ _null_ pg_get_ruledef_ext _null_ _null_ _null_ )); diff --git a/src/include/catalog/pg_type.h b/src/include/catalog/pg_type.h index 0f7312e..9baed6c 100644 --- a/src/include/catalog/pg_type.h +++ b/src/include/catalog/pg_type.h @@ -615,7 +615,7 @@ DATA(insert OID = 2276 ( any PGNSP PGUID 4 t p P f t \054 0 0 0 any_in any_ou #define ANYOID 2276 DATA(insert OID = 2277 ( anyarray PGNSP PGUID -1 f p P f t \054 0 0 0 anyarray_in anyarray_out anyarray_recv anyarray_send - - - d x f 0 -1 0 0 _null_ _null_ )); #define ANYARRAYOID 2277 -DATA(insert OID = 2278 ( void PGNSP PGUID 4 t p P f t \054 0 0 0 void_in void_out - - - - - i p f 0 -1 0 0 _null_ _null_ )); +DATA(insert OID = 2278 ( void PGNSP PGUID 4 t p P f t \054 0 0 0 void_in void_out void_recv void_send - - - i p f 0 -1 0 0 _null_ _null_ )); #define VOIDOID 2278 DATA(insert OID = 2279 ( trigger PGNSP PGUID 4 t p P f t \054 0 0 0 trigger_in trigger_out - - - - - i p f 0 -1 0 0 _null_ _null_ )); #define TRIGGEROID 2279 diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h index 8392be6..8652ba0 100644 --- a/src/include/utils/builtins.h +++ b/src/include/utils/builtins.h @@ -504,6 +504,8 @@ extern Datum anyenum_in(PG_FUNCTION_ARGS); extern Datum anyenum_out(PG_FUNCTION_ARGS); extern Datum void_in(PG_FUNCTION_ARGS); extern Datum void_out(PG_FUNCTION_ARGS); +extern Datum void_recv(PG_FUNCTION_ARGS); +extern Datum void_send(PG_FUNCTION_ARGS); extern Datum trigger_in(PG_FUNCTION_ARGS); extern Datum trigger_out(PG_FUNCTION_ARGS); extern Datum language_handler_in(PG_FUNCTION_ARGS); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Varchar and binary protocol
Actually difference is http://archives.postgresql.org/pgsql-hackers/2011-02/msg00415.php Merlin Moncure mmonc...@gmail.com Thursday 10 February 2011 08:48:26 On Sat, Feb 5, 2011 at 4:59 PM, Radosław Smogura rsmog...@softperience.eu wrote: Hi, I do performance tests against orignal JDBC driver and my version in binary and in text mode. I saw strange results when I was reading varchar values. Here is some output from simple benchmark Plain strings speed Execution: 8316582, local: 2116608, all: 10433190 Binary strings speed Execution: 9354613, local: 2755949, all: 12110562 Text NG strings speed Execution: 8346902, local: 2704242, all: 11051144 Plain is standard JDBC driver, Binary is my version with binary transfer, Text is my version with normal transfer. 1st column, Execution is time spend on query execution this includes send, recivie proto message, store it, etc, no conversion to output format. Values are in nanoseconds. In new version I added some functionality, but routines to read parts in Execution block are almost same for binary and text. But as you see the binary version is 10-20% slower then orginal, and my text version, if I increase number of read records this proportion will not change. I done many checks, against even skip proto message content driver, end results was same 10-20% slower. Since there is basically zero difference in how *varchar* is handled in the database for the text or binary protocols (AFAIK, they use the same code), this is almost certainly an issue with the JDBC driver, or your benchmark application. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Varlena and binary
Just from curious may I ask in which direction this will go, and how this will affect performance of text and binary format? Actually I started to make smaller improvements, and I think about one big to encode text (when client and server encoding are different) directly to StringInfo, without intermediate buffer. Thanks in advice Radek Tom Lane t...@sss.pgh.pa.us Monday 07 February 2011 17:12:07 =?utf-8?q?Rados=C5=82aw_Smogura?= m...@smogura.eu writes: I'm sending small patch for textsend. It reduces unnecessary copies, and memory usage for duplication of varlena data. May you look? This code will break the day that text and bytea don't have the same internal representation, which seems likely to be soon. Barring some compelling evidence of a major performance improvement obtainable this way, I don't think we want this patch. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Varlena and binary
Hi, I'm sending small patch for textsend. It reduces unnecessary copies, and memory usage for duplication of varlena data. May you look? Kind regards, Radosław Smogura diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index e111d26..f24bbcd 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -442,12 +442,20 @@ textrecv(PG_FUNCTION_ARGS) Datum textsend(PG_FUNCTION_ARGS) { - text *t = PG_GETARG_TEXT_PP(0); - StringInfoData buf; - - pq_begintypsend(buf); - pq_sendtext(buf, VARDATA_ANY(t), VARSIZE_ANY_EXHDR(t)); - PG_RETURN_BYTEA_P(pq_endtypsend(buf)); +text *t = PG_GETARG_TEXT_PP(0); +const char* textData = VARDATA_ANY(t); +const int textSize = VARSIZE_ANY_EXHDR(t); +char* textConverted = pg_server_to_client(textData, textSize); +//Logic based on pq_sendtext +if (textConverted == textData) { + PG_RETURN_BYTEA_P(t); +}else { + StringInfoData buf; + pq_begintypsend(buf); + appendBinaryStringInfo(buf, textConverted, strlen(textConverted)); + pfree(textConverted); + PG_RETURN_BYTEA_P(pq_endtypsend(buf)); + } } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Varchar and binary protocol
Hi, I do performance tests against orignal JDBC driver and my version in binary and in text mode. I saw strange results when I was reading varchar values. Here is some output from simple benchmark Plain strings speed Execution: 8316582, local: 2116608, all: 10433190 Binary strings speed Execution: 9354613, local: 2755949, all: 12110562 Text NG strings speed Execution: 8346902, local: 2704242, all: 11051144 Plain is standard JDBC driver, Binary is my version with binary transfer, Text is my version with normal transfer. 1st column, Execution is time spend on query execution this includes send, recivie proto message, store it, etc, no conversion to output format. Values are in nanoseconds. In new version I added some functionality, but routines to read parts in Execution block are almost same for binary and text. But as you see the binary version is 10-20% slower then orginal, and my text version, if I increase number of read records this proportion will not change. I done many checks, against even skip proto message content driver, end results was same 10-20% slower. Regards, Radek. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Binary timestamp with without timezone
On Thu, 16 Dec 2010 14:24:27 -0500, Tom Lane t...@sss.pgh.pa.us wrote: =?utf-8?q?Rados=C5=82aw_Smogura?= rsmog...@softperience.eu writes: Tom Lane t...@sss.pgh.pa.us Thursday 16 December 2010 18:59:56 =?utf-8?q?Rados=C5=82aw_Smogura?= rsmog...@softperience.eu writes: ... This timestamp must be properly encoded depending if target is WITH TZ or not, but JDBC (and other clients, probably too) doesn't have any knowledge about target type when statement is executed Seems like you need to fix *that*. I don't say it's bad way to send timestamps in text mode. It's good solution, because timestamp without tz will silently ignore tz offset, timestamp with tz will use offset in calculations if it is there, if no it will use server TZ. No, what I'm saying is that it's complete folly to be sending binary data for a value you don't know the exact type of. I know something about value I want to send, but only this it should be a timestamp. I don't know if it should be with or without tz. There are too many ways for that to fail, and too few ways for the backend to validate what you sent. Adding more possible ways to interpret a binary blob makes that problem worse, not better. Official JDBC driver release use this technique to send timezone timestamps, but for text mode; any timestamp is send as UNSPECIFIED. So text mode driver can fail in this way too. What you need to fix is the inadequate type bookkeeping in JDBC. If you don't know the exact type of the value you're going to send, send it in text mode, where you have some reasonable hope of a mismatch being detected. I know that this procedure isn't good as well as in text mode and in binary mode, but gives any chance to do it better. In both cases we can find examples when this behaviour will fail, but In proposed solution I added (I hope in safe way) support for timezone information, that is missing in comparison to binary protocol, which can be useful. Maybe better idea is to create new timestamptz type, that will fully support TIME offsets, as well and most important, will give much more client friendly casting to timestamp and timestamptz-s. I mean it should be casted to timestamptz, as well to timestamp, but in last situation, per field base ('2010-01-01 +1:00)::timestamp - '2010-01-01'. It could be better, because missing tz offset in current implementation can cause problems with historical DST offset (many posts found). Binary protocol will not have this disadvantage when reading, because Java supports historical DST, and timestamptz is UTC based. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Binary timestamp with without timezone
Hi, I work on binary support for JDBC. I saw disadventage of TIMESTAMPS WITH / WITHOUT TZ. Currently (in text mode) driver always sends date time string with appended time offset, as UNSPECIFIED so backend can choose to use offset or not. In binary mode I can only send 8 bytes timestamp without appended timezone. This timestamp must be properly encoded depending if target is WITH TZ or not, but JDBC (and other clients, probably too) doesn't have any knowledge about target type when statement is executed (so currently I send timestamps as text). I think about following patch (giving backward compatibility) on timestamp (tz). Idea is as follows if we have additional two bytes it's TZ offset and use this to convert received time to UTC. I wrote it in e-mail editor (sorry :) no C IDE last time), Datum timestamptz_recv(PG_FUNCTION_ARGS) { StringInfo buf = (StringInfo) PG_GETARG_POINTER(0); #ifdef NOT_USED Oid typelem = PG_GETARG_OID(1); #endif int32 typmod = PG_GETARG_INT32(2); TimestampTz timestamp; int tz; struct pg_tm tt, *tm = tt; fsec_t fsec; char *tzn; int16tzOffset; //Zone offset with precision to minutes 12*60=720 #ifdef HAVE_INT64_TIMESTAMP timestamp = (TimestampTz) pq_getmsgint64(buf); #else timestamp = (TimestampTz) pq_getmsgfloat8(buf); #endif + if (buf-len == 10) { //We assume two last bytes is timezone offset + tzOffset = pg_copymsgbytes(buf, tzOffset,2 /*sizeof(int16)*/); +#ifdef HAVE_INT64_TIMESTAMP +timestamp -= ((int16) tzOffset) * 60 /* sek */ * USECS_PER_SEC; +#else +timestamp -= (float8) (tzOffset * 60 /* sek */); //Good casting...? +#endif + } /* rangecheck: see if timestamptz_out would like it */ if (TIMESTAMP_NOT_FINITE(timestamp)) /* ok */ ; else if (timestamp2tm(timestamp, tz, tm, fsec, tzn, NULL) != 0) ereport(ERROR, (errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE), errmsg(timestamp out of range))); AdjustTimestampForTypmod(timestamp, typmod); PG_RETURN_TIMESTAMPTZ(timestamp); } Will be this enaugh to allow to send TIMESTAMPS WITH(OUT) TZ as UNSPECIFIED? Simillar should go for (ugly) time with timezone. //Ofc, excelent behaviour will be if I could send values always with TZ and //leave _proper_ casting to backend. Kind regards, Radosław Smogura http://www.softperience.eu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Binary timestamp with without timezone
Tom Lane t...@sss.pgh.pa.us Thursday 16 December 2010 18:59:56 =?utf-8?q?Rados=C5=82aw_Smogura?= rsmog...@softperience.eu writes: I work on binary support for JDBC. I saw disadventage of TIMESTAMPS WITH / WITHOUT TZ. Currently (in text mode) driver always sends date time string with appended time offset, as UNSPECIFIED so backend can choose to use offset or not. In binary mode I can only send 8 bytes timestamp without appended timezone. This timestamp must be properly encoded depending if target is WITH TZ or not, but JDBC (and other clients, probably too) doesn't have any knowledge about target type when statement is executed Seems like you need to fix *that*. I don't say it's bad way to send timestamps in text mode. It's good solution, because timestamp without tz will silently ignore tz offset, timestamp with tz will use offset in calculations if it is there, if no it will use server TZ. I think about following patch (giving backward compatibility) on timestamp (tz). Idea is as follows if we have additional two bytes it's TZ offset and use this to convert received time to UTC. I wrote it in e-mail editor (sorry :) no C IDE last time), This is not a terribly good idea, and even if it was, how will you use it from a client that doesn't know which data type is really in use? Binary protocol disallow to send timezone offset, as text mode allow (lack of information). I would like to send this in same behavior as text mode does: send local time with two bytes of client tz as OID unspecified. One more instead of pq_copymsgbtes better will be tzOffset = pq_getint(buf, sizeof(int16)); Kind regards, Radosław Smogura http://www.softperience.eu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [JDBC] [HACKERS] Improved JDBC driver part 2
I've just started small clean up - now it's there. On Wed, 1 Dec 2010 12:06:19 +0100, Valentine Gogichashvili val...@gmail.com wrote: Hi, I cannot get the file: wget http://www.rsmogura.net/pgsql/pgjdbc_exp_20101130_C.tar.gz --2010-12-01 12:05:28-- http://www.rsmogura.net/pgsql/pgjdbc_exp_20101130_C.tar.gz Resolving www.rsmogura.net... 64.120.14.83 Connecting to www.rsmogura.net|64.120.14.83|:80... connected. HTTP request sent, awaiting response... 404 Not Found 2010-12-01 12:05:29 ERROR 404: Not Found. On Tue, Nov 30, 2010 at 7:49 PM, Radosław Smogura m...@smogura.eu wrote: Hello, Maybe you are interested about this what I done with JDBC === Original driver (Text mode) === * Memory * 1. Memory usage improvments when using result set input streams (no uneeded memory copy) - needs few touches for bigger performance. 2. Memory usage improvments for large data, should be no problem to load 1GB bytea[] when have only 300MB of memory (threshold size still hardcoded). * JDBC 4 * 1. XML are now correctly transformed before send to server - previous version used normal text-file transformations that is not enaugh. 2. In all modes (text/binary) XMLs are sended in binary mode, so driver don't need to do special transformation (does it require libxml?), until character streams are used. 3. JDBC4 exception throwing. 4. XML objects are readable only once, you can't reuse it, update form result set (silently set to null on RS.updateRow() - shouldn't be silent) returns null till refreshRow(), but you can write to them after load. 5.Target XML behavior is streaming behavior to don't repeat problems with bytea. * JDBC 4.1 * 1. Just started. * Others * 1. Few additional test cases. Few utils for XML checking (string equals is too less) no good, but better. 2. Fixed bug, causing inproper time(stamps) encoding for WITH TIME ZONE fields, after changing default time zone. === Binary mode === 1. Read for almost all data types with arrays. 2. Write for few. 3. Much more restrictive checking when casting form one type to other. 4. Exceptions when casting from one type to other inproper type. 5. Still ResultSet.getString() for XML will return XML - this spec. prohibited (X - base type conversion, x - possible conversion, no x - no base and possible = no conversion). 6. No getPriviliges for metadata - no binary output for ACL!!! 7. Many, many tests passed. 8. Data reading is faster for all reads (checked with profiler, against original driver). Driver is here http://www.rsmogura.net/pgsql/pgjdbc_exp_20101130_C.tar.gzis currently JDK 6 compatible (will be not), compressed patch takes about 136kb gziped. Kind regards have a nice day -- Radosław Smogura http://www.softperience.eu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- -- Radosław Smogura http://www.softperience.eu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [JDBC] [HACKERS] Improved JDBC driver part 2
On Wed, 1 Dec 2010 12:47:13 +0100, Magnus Hagander mag...@hagander.net wrote: On Tue, Nov 30, 2010 at 19:49, Radosław Smogura m...@smogura.eu wrote: Hello, Maybe you are interested about this what I done with JDBC snip Driver is here http://www.rsmogura.net/pgsql/pgjdbc_exp_20101130_C.tar.gz is currently JDK 6 compatible (will be not), compressed patch takes about 136kb gziped. Is there any particular reason why this work can't be maintained as a branch to the main driver? My understanding is your work is based off that one? Being able to work like that would make things a lot easier to review. Yes, it's based on this, with CVS subfolders in sources. I don't see any problems to maintain this as branch. Ah only one need to read something about CVS branching. That said, such a process would also be a lot easier if the JDBC driver wasn't in cvs ;) Yes, SVN is much more nicer. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- -- Radosław Smogura http://www.softperience.eu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Improved JDBC driver part 2
Hello, Maybe you are interested about this what I done with JDBC === Original driver (Text mode) === * Memory * 1. Memory usage improvments when using result set input streams (no uneeded memory copy) - needs few touches for bigger performance. 2. Memory usage improvments for large data, should be no problem to load 1GB bytea[] when have only 300MB of memory (threshold size still hardcoded). * JDBC 4 * 1. XML are now correctly transformed before send to server - previous version used normal text-file transformations that is not enaugh. 2. In all modes (text/binary) XMLs are sended in binary mode, so driver don't need to do special transformation (does it require libxml?), until character streams are used. 3. JDBC4 exception throwing. 4. XML objects are readable only once, you can't reuse it, update form result set (silently set to null on RS.updateRow() - shouldn't be silent) returns null till refreshRow(), but you can write to them after load. 5.Target XML behavior is streaming behavior to don't repeat problems with bytea. * JDBC 4.1 * 1. Just started. * Others * 1. Few additional test cases. Few utils for XML checking (string equals is too less) no good, but better. 2. Fixed bug, causing inproper time(stamps) encoding for WITH TIME ZONE fields, after changing default time zone. === Binary mode === 1. Read for almost all data types with arrays. 2. Write for few. 3. Much more restrictive checking when casting form one type to other. 4. Exceptions when casting from one type to other inproper type. 5. Still ResultSet.getString() for XML will return XML - this spec. prohibited (X - base type conversion, x - possible conversion, no x - no base and possible = no conversion). 6. No getPriviliges for metadata - no binary output for ACL!!! 7. Many, many tests passed. 8. Data reading is faster for all reads (checked with profiler, against original driver). Driver is here http://www.rsmogura.net/pgsql/pgjdbc_exp_20101130_C.tar.gz is currently JDK 6 compatible (will be not), compressed patch takes about 136kb gziped. Kind regards have a nice day -- Radosław Smogura http://www.softperience.eu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [JDBC] JDBC and Binary protocol error, for some statements
This what I done is 1. Send bind 2. Put on stack the requested format types 3. On bind complete get requestedFormats from stack. 4. When execute is complete (or portal suspend) then, use requestedFormats to change the field formats received from describe or previously cached. I assume server can't change formats after bind, even the describe portal was fired. Is it all good? I don't know much about server internals. Kind regards, Radek. On Fri, 26 Nov 2010 01:02:25 -0500, Tom Lane t...@sss.pgh.pa.us wrote: Maciek Sakrejda msakre...@truviso.com writes: 21:43:02.264 (26) FE= Describe(statement=S_1) You're still doing the statement-flavor Describe. As Tom pointed out, this won't tell you the result types because it doesn't know them. Actually, technically if you issue a statement-flavor Describe *after* a Bind, the server does have this information, but I'm not surprised that it doesn't send it correctly, since it seems pointless to send the statement variation after already doing a Bind. In principle you could open more than one Portal off a Statement at the same time, so it wouldn't necessarily be well-defined anyway. regards, tom lane -- -- Radosław Smogura http://www.softperience.eu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [JDBC] JDBC and Binary protocol error, for some statements
I checked against other parameter bindings and it looks like problem is connected with oid=0. In those cases: 1. Executing statement with parameter sent as varchar, int, long, with text and binary format is ok. 2. Executing statement with oid=0 fail always; I've sent parameter in text mode (encoded '2'), and in binary mode encoded int4 2 - both failed. On Thu, 25 Nov 2010 01:56:02 -0800, Maciek Sakrejda msakre...@truviso.com wrote: Haven't really gotten much further, but an interesting note: the named / unnamed prepared statement and portal stuff seems to be a red herring. I can add a name to the portal, or move to an unnamed prepared statement, and I still see the same thing. Which is interesting, since that's not what Radosław saw (his original test only failed once named prepared statements kicked in)... --- Maciek Sakrejda | System Architect | Truviso 1065 E. Hillsdale Blvd., Suite 215 Foster City, CA 94404 (650) 242-3500 Main www.truviso.com -- -- Radosław Smogura http://www.softperience.eu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [JDBC] JDBC and Binary protocol error, for some statements
Hi, Thank you for your response. I would only ask to be sure... So, to summarise, I shouldn't believe server DescribeRow (in context of format), in this situation, but only I should look at this what I asked for, isn't it? If I asked for columns in binary format, I need to do binary reading regarding what server has responded? If I asked for odd columns in text, even in binary do I need to choose proper format basing only on my request? But to the last part of cited protocol specification, when I've sent message with statement parameter's type int4, int8, varchar the format field wasn't set to 0, but 1. Kind regards, Radosław Smogura On Thu, 25 Nov 2010 12:23:03 -0500, Tom Lane t...@sss.pgh.pa.us wrote: AFAICS this isn't a bug. What you're issuing Describe against is the prepared statement, not the portal. The result column formats are not specified by a prepared statement, so Describe just returns zeroes for them. Result column formats are specified by the Bind command, which creates a Portal. If you'd issued the Describe against the Portal, you should get back the correct format codes. Per the protocol specification: The Describe message (statement variant) specifies the name of an existing prepared statement (or an empty string for the unnamed prepared statement). The response is a ParameterDescription message describing the parameters needed by the statement, followed by a RowDescription message describing the rows that will be returned when the statement is eventually executed (or a NoData message if the statement will not return rows). ErrorResponse is issued if there is no such prepared statement. Note that since Bind has not yet been issued, the formats to be used for returned columns are not yet known to the backend; the format code fields in the RowDescription message will be zeroes in this case. Now, if there's something in the JDBC driver that expects DescribeStatement to return useful result format codes, that'd be a bug in the driver. regards, tom lane -- -- Radosław Smogura http://www.softperience.eu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [JDBC] JDBC and Binary protocol error, for some statements
On Thu, 25 Nov 2010 11:28:02 -0800, Maciek Sakrejda msakre...@truviso.com wrote: So, to summarise, I shouldn't believe server DescribeRow (in context of format), in this situation, but only I should look at this what I asked for, isn't it? If I asked for columns in binary format, I need to do binary reading regarding what server has responded? Yes, because in this case 0 doesn't mean the result will be in text, it means, you issued the statement-variant of Describe, so I'm not sure what the result format will be yet. If I asked for odd columns in text, even in binary do I need to choose proper format basing only on my request? I don't quite understand this question, but I think so. I don't think there's ever a situation where the server will ignore your result format requests. But to the last part of cited protocol specification, when I've sent message with statement parameter's type int4, int8, varchar the format field wasn't set to 0, but 1. I wasn't able to reproduce that with my standalone test case. When I changed the parameter oid to 23, I still got the same behavior. Can you alter my test case to reproduce the error? I will do it tomorrow. In this situation I need to test portals as well. --- Maciek Sakrejda | System Architect | Truviso 1065 E. Hillsdale Blvd., Suite 215 Foster City, CA 94404 (650) 242-3500 Main www.truviso.com -- -- Radosław Smogura http://www.softperience.eu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [JDBC] JDBC and Binary protocol error, for some statements
Hm... I moved Bind before Describe, I now have // Construct a new portal if needed. Portal portal = null; if (usePortal) { String portalName = C_ + (nextUniqueID++); portal = new Portal(query, portalName); } sendBind(query, params, portal, noBinaryTransfer); if (describeStatement) { sendDescribeStatement(query, params, describeOnly); if (describeOnly) return; } still nothing 21:43:02.263 (26) simple execute, handler=org.postgresql.jdbc2.abstractjdbc2statement$statementresulthand...@337ec9f7, maxRows=0, fetchSize=0, flags=16 21:43:02.264 (26) FE= Parse(stmt=S_1,query=SELECT $1::int,oids={0}) 21:43:02.264 (26) FE= Bind(stmt=S_1,portal=null,$1='2') 21:43:02.264 (26) FE= Describe(statement=S_1) 21:43:02.264 (26) FE= Execute(portal=null,limit=0) 21:43:02.265 (26) FE= Sync 21:43:02.265 (26) =BE ParseComplete [S_1] 21:43:02.265 (26) =BE BindComplete [null] 21:43:02.266 (26) =BE ParameterDescription 21:43:02.266 (26) =BE RowDescription(1) 21:43:02.266 (26) Field(,INT4,4,T) 21:43:02.266 (26) =BE DataRow(len=4) 21:43:02.267 (26) =BE CommandStatus(SELECT 1) 21:43:02.267 (26) =BE ReadyForQuery(I) If I've understood well I should get proper result after 1st bind...? On Thu, 25 Nov 2010 12:21:39 -0800, Maciek Sakrejda msakre...@truviso.com wrote: OTOH, it seems possible that the JDBC driver might behave differently depending on whether parameter types were prespecified or not --- it might issue Describe earlier in order to get the parameter types, perhaps. Ah. Bingo: boolean describeStatement = describeOnly || (!oneShot paramsHasUnknown queryHasUnknown !query.isStatementDescribed()); --- Maciek Sakrejda | System Architect | Truviso 1065 E. Hillsdale Blvd., Suite 215 Foster City, CA 94404 (650) 242-3500 Main www.truviso.com -- -- Radosław Smogura http://www.softperience.eu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [JDBC] JDBC and Binary protocol error, for some statements
Thank you, but I think about this last night. Opening unnecessary portals isn't good idea, similarly sending 2nd describe when statement was prepared. Currently JDBC drivers doesn't make this. I think better will be to store what format we had requested on stack, and then coerce those formats when results are handled. Kind regards, Radosław Smogura On Fri, 26 Nov 2010 01:02:25 -0500, Tom Lane t...@sss.pgh.pa.us wrote: Maciek Sakrejda msakre...@truviso.com writes: 21:43:02.264 (26) FE= Describe(statement=S_1) You're still doing the statement-flavor Describe. As Tom pointed out, this won't tell you the result types because it doesn't know them. Actually, technically if you issue a statement-flavor Describe *after* a Bind, the server does have this information, but I'm not surprised that it doesn't send it correctly, since it seems pointless to send the statement variation after already doing a Bind. In principle you could open more than one Portal off a Statement at the same time, so it wouldn't necessarily be well-defined anyway. regards, tom lane -- -- Radosław Smogura http://www.softperience.eu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [JDBC] JDBC and Binary protocol error, for some statements
I didn't described log correctly, 1st attached response is normal execution; flags QUERY_SUPPRESS_BEGIN | QUERY_ONESHOT, 2nd is compiled statement QUERY_SUPPRESS_BEGIN only. Text format is marked as 0, binary format is 1. The 1st shown execution (flags=17) is good, it tells that result is sended in binary format, as int4, but 2nd one (flags=16) (statement compiled) result is bad because Server described row as text 07:52:06.061 (54) =BE RowDescription(1) 07:52:06.061 (54) Field(,INT4,4,T) but recieved tuple was clearly in binary format, 0x00, 0x00, 0x00, 0x02. (Look at length should be 1 or 2 if it's text format and value 2) Speaking it simple, server said you will recive text data, but sent it as binary encoded int. I've checked this againt 8.4 and 9.0.1. Maciek Sakrejda msakre...@truviso.com Wednesday 24 November 2010 18:02:27 Result is oid=23, format=(0) T, value = 0x00,0x00,0x00,0x02 What do you mean regarding the format? Are you just inferring that from the data? If memory serves, the format of a particular column is not specified anywhere other than the RowDescription, and according to your JDBC log output above, the server is telling you the format is text (1) (which is your point--it doesn't match the resulting data--but I want to make sure we're clear on what's actually going on). It's jdbc2.PreparedStatementTest, form JDBC driver unit tests. I've exposed sources here http://www.rsmogura.net/pgsql/pgjdbc_exp_20101124.tar.gz compiled driver and unit tests are in parent directory. In above all not related to this bug tests has been commented, just run ant test. Also, can you narrow this down to a simple, self-contained test case (with code)? Even if it's against a custom driver build, that would be easier to investigate. --- Maciek Sakrejda | System Architect | Truviso 1065 E. Hillsdale Blvd., Suite 215 Foster City, CA 94404 (650) 242-3500 Main www.truviso.com Kind regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] JDBC and Binary protocol error, for some statements
Hi, I work on implementing binary protocol, during test select ?::int has been executed few times to compile it to server prepared statement, but after this backend returned field format = text, but sent field data in binary mode. The parameter is long and is sent to in text mode, with oid 0, as value 2 Result is oid=23, format=(0) T, value = 0x00,0x00,0x00,0x02 When I send parameter explicite in binary mode and int4 oid, problem doesn't occurs. Here is debug of protocol --- 07:52:02.995 (54) simple execute, handler=org.postgresql.jdbc2.abstractjdbc2statement$statementresulthand...@508aeb74, maxRows=0, fetchSize=0, flags=17 07:52:02.996 (54) FE= Parse(stmt=null,query=SELECT $1::int,oids={0}) 07:52:02.996 (54) FE= Bind(stmt=null,portal=null,$1='2') 07:52:02.996 (54) FE= Describe(portal=null) 07:52:02.996 (54) FE= Execute(portal=null,limit=0) 07:52:02.996 (54) FE= Sync 07:52:02.997 (54) =BE ParseComplete [null] 07:52:02.997 (54) =BE BindComplete [null] 07:52:02.998 (54) =BE RowDescription(1) 07:52:02.998 (54) Field(,INT4,4,B) 07:52:02.998 (54) =BE DataRow(len=4) 07:52:02.998 (54) =BE CommandStatus(SELECT) 07:52:02.999 (54) =BE ReadyForQuery(I) --- Compiled stat exec 07:52:06.059 (54) simple execute, handler=org.postgresql.jdbc2.abstractjdbc2statement$statementresulthand...@128ae45a, maxRows=0, fetchSize=0, flags=16 07:52:06.059 (54) FE= Parse(stmt=S_1,query=SELECT $1::int,oids={0}) 07:52:06.059 (54) FE= Describe(statement=S_1) 07:52:06.060 (54) FE= Bind(stmt=S_1,portal=null,$1='2') 07:52:06.060 (54) FE= Execute(portal=null,limit=0) 07:52:06.060 (54) FE= Sync 07:52:06.060 (54) =BE ParseComplete [S_1] 07:52:06.061 (54) =BE ParameterDescription 07:52:06.061 (54) =BE RowDescription(1) 07:52:06.061 (54) Field(,INT4,4,T) 07:52:06.061 (54) =BE BindComplete [null] 07:52:06.061 (54) =BE DataRow(len=4) 07:52:06.062 (54) =BE CommandStatus(SELECT) 07:52:06.062 (54) =BE ReadyForQuery(I) org.postgresql.jdbc4.exceptions.PSQLDataException: Zła wartość dla typu int: Kind regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Experimental JDBC driver
Hello, I last time added some features to JDBC driver meking in some way experimental driver, it could be quite usefull, but if it's possible I would like to ask for some reviews and tests. Maybe some ideas from this will be handly. Beacause the chageset is huge, I don't give patch set, but only URL where You can find sources and compiled binary http://www.rsmogura.net/pgsql/ (drivers are called pgjdbc_exp). Below is some info about changes, in TODO and TO_THINK is set of free ideas about this. === JDBC 4 Exceptions === 1. Driver can throw JDBC 4 exceptions. 2. The PSQLException is now interface, to keep as far as possible backward compatibility, read TO_THINK 3. TODO: In many static methods still exists PgSqlExceptionFactory.getDefault() this should be removed. 4. TODO: Add functionality to allow user to choose between throwin PSQL*Exceptions, or plain SQL*Exception. 5. TODO: Check if decision block about exception type is fully proper and there is no missign states, eventualy try to decide to throw Trnasient / Non Transient exceptions. 6. Currently we throw only Non Transient exception, as any error in server only allows to commit or rollback transaction, non future data read or update is possible. 7. TO_THINK: Change PSQLException to interface should make no differences when client application will be fully rebuilded, but if we have some 3rd party binary library its execution can be break (different VM opcode for calling methods). Because we have exception factory we can revoke PSQLException to class, create PSQLExceptionInterface and add additional connection parameter useOldException default true - if true then PSQLException will be thrown on error, false SQLException implementing PSQLExceptionInterface. === Binary Read and Write === Thanks to Mikko for some concepts from his patch (especialy about integer/float dates). 1.Few basic data types has been written (with server types) - read: bool, double, int, short, decimal, varchar, text; date, timestamp (with / without time zone) - write: int, decimal, date, timestamp - casting from server type to Java type is currently supported for few classes and types, need to check this step by step with JDBC spec - dates / times are limited to integer dates server, currently 2.The reader / writer is configured by connection basis with factories so if in future PostgreSQL will change the binary representation only the appropriate reader / writer implementation will change. 3. TODO: Use simillar to binary factory machanism to read / write character protocol. 4. TODO: Add binary modes for PGType classes. 5. TODO: Coopy all statics converters form PGType to factory, and mark methods as deprecated (statics disallow to use custom exception factories). 5a. Subtask create casting facility extracting PGTypes cast methods to this facility, connect it with current if..else trees in ResultSet.get() 6. TODO: Plugability model to internal reading and writing other server types in binary / text mode. 7. Add support for UUID Java class and server type. 8. TODO: Binary reader always return byte[], think about directly writting to stream (less GC overhead). 9.Hack JDK MutuableInteger, less GC overhead and more preformance when parsing DECIMAL. 10. TODO: Make additional test about timestamps (dates, etc.) with / without time zone when server is in different timezone then driver, and what will hapen on TZ change in client. Subject to ask on pg-hackers. 11. TO_THINK: It's good point to think about changing OID to JDBC RowId, just sugar for sugar. 12. No describe or lazy decision to read given value as binary or character. 13. If you would like to test binary for all JUnit test call ant -DbinaryTransfer=true === Other changes === 1. setStatementTimeout - uses Java Timer. It's not fully secure, as timer will abort statement when statement is executing, long Java processing, or some sub statements can break this soultion. === JDBC 4.1 === 1. build.xml changes, and basic implementation (with UnsupportedOperaionException) of some JDBC interfaces to use JDK 7 JDBC 4.1 Kind regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [JDBC] Support for JDBC setQueryTimeout, et al.
On Tue, 12 Oct 2010 20:03:29 +0200, Magnus Hagander mag...@hagander.net wrote: On Tue, Oct 12, 2010 at 17:55, David Fetter da...@fetter.org wrote: On Tue, Oct 12, 2010 at 10:37:00AM -0500, Kevin Grittner wrote: David Fetter da...@fetter.org wrote: Is there something incomplete about the ones I sent, and if so, what? Well, I'm still curious why it was necessary to modify the server side to implement an interface feature for which everything needed seems to be present on the client side. Not everything is. Let's imagine you have a connection pooler with two clients, A and B. A calls setQueryTimeout, then starts a query, which terminates in time, but dies before handling it. B connects to the pool, gets A's connection, and finds a statement_timeout that's not the default, even though only A's single query was supposed to have that statement_timeout. This is not a situation that can be resolved without being able to set a timer *on the server side*. Sure it can. The connection pooler just needs to issue a RESET ALL statement when it hands over a connection from one client to another. Isn't that what for example pgbouncer does - at least when configured per instructions? Also, doesn't this affect *all* settings, not just timeout, if it doesn't? Imagine client A executing a SET datestyle for example. AFAICS, any connection pooler that *doesn't* issue a reset between handing this around is broken, isn't it? If I'm right you would like to say, that when taking connection from pool REST ALL should be invoked. But... connection pooler will not send RESET ALL in some situations, because JDBC driver can have no notify about switching client. In EE platforms (e. g. Glassfish), server sometimes creates it's own pool and in certain configuration, when you close Connection, server will never pass close to PooledConection, nor physical connection. This due to fact, that GF and others adds functionality for statements pooling (if server will call close on pooled conn, then reusing cached statement will cause error - in fact this problem occurs with Hibernate, C3po, XA, and GFv3). To summarize, you should never believe that RESET ALL will be called, nor any other behavior when switching clients. Am I right? -- Radosław Smogura http://www.softperience.eu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [JDBC] Support for JDBC setQueryTimeout, et al.
On Fri, 15 Oct 2010 10:37:05 +0200, Magnus Hagander mag...@hagander.net wrote: But... connection pooler will not send RESET ALL in some situations, because JDBC driver can have no notify about switching client. In EE platforms (e. g. Glassfish), server sometimes creates it's own pool and in certain configuration, when you close Connection, server will never pass close to PooledConection, nor physical connection. This due to fact, that GF and others adds functionality for statements pooling (if server will call close on pooled conn, then reusing cached statement will cause error - in fact this problem occurs with Hibernate, C3po, XA, and GFv3). To me, that sounds like a bug in the connection pooler. It is only safe under quite limited circumstances. It's hard to say this is bug. The GF connection pooler is general pooler for all drivers, so it can't know anything about reseting, and if I have right JDBC4 doesn't support such notifications. It can't close logical connections, as if would close, cached statements will be invalideted too. But benefits of pooling statements are much more greater then RESET ALL, because you can take advance of precompiling prepared statements, increasing performance; it is comparable to using connection pool instead of starting physical connections. In ~2008, Sun published result of (spectest?) Glassfih v2 and PostgreSQL with special patches allowing statement caching (on JDBC driver side) - it was the fastest combination, biting all professional and highly paid software. Probably same wrapping can do JBoss, WAS, and others. in fact this problem occurs with Hibernate, C3po, XA, and GFv3). Hibernate, with C3P0 hacks and uses some private code, unwraps objects, etc... :(, so when private implementation changes we have BUM. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- -- Radosław Smogura http://www.softperience.eu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [JDBC] Support for JDBC setQueryTimeout, et al.
On Wed, 13 Oct 2010 21:01:06 -0400, Robert Haas robertmh...@gmail.com wrote: Is this a JDBC patch or a PG patch? Are we tracking JDBC patches using the CF app? It is JDBC patch. I will clean it and submit on this site. I didn't know about such application and such process. -- Radosław Smogura http://www.softperience.eu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [JDBC] Support for JDBC setQueryTimeout, et al.
Regarding JDBC in the CF process -- other interfaces are handled there. I haven't seen one patch this size for JDBC since I've been involved, let alone two competing patches to implement the same feature. Small patches which can be quickly handled don't make sense to put into the process, but it seemed reasonable for these. In any way I'm sending this patch, and I will put this under Miscellaneous in CF. This cleared patch takes only 47k (in uncleared was some binary read classes) and about 50% it's big test case. Have a nice day, Radek statemnt_to_20101014.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [JDBC] Support for JDBC setQueryTimeout, et al.
This, what I see in your patch, is sending additional statement to server. This adds some unnecessery (especially TCP/IP) latency. gura I sent such patch fully in Java (http://archives.postgresql.org/pgsql-jdbc/2009-11/msg00010.php), implementing cancellation with Timer and cancel query facility of PSQL server, Would you like to update it? I updated patch to latets CVS version, I didn't have time to remove some trashes from it. If something will be wrong with patch, give a feedback. Kind regards, Radosław Smogura http://softperience.pl statemnt_to_20101013.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [JDBC] Support for JDBC setQueryTimeout, et al.
On Mon, 11 Oct 2010 08:29:16 -0500, Kevin Grittner kevin.gritt...@wicourts.gov wrote: David Fetter da...@fetter.org wrote: Please find enclosed a WIP patch from one of my co-workers intended to support JDBC's setQueryTimeout, along with the patch for JDBC that uses it. I agree that it would be very nice to support this JDBC feature, but I'm not clear on why this can't be done with just JDBC changes using the java.util.Timer class and the existing Statement.cancel() method. Can you explain why the backend needed to be touched? -Kevin I sent such patch fully in Java (http://archives.postgresql.org/pgsql-jdbc/2009-11/msg00010.php), implementing cancellation with Timer and cancel query facility of PSQL server, unfortunately none has revised it, even that setQuertyTimeout todo is for long time on dashboard, and it's important for enterprise class software. -- Radosław Smogura http://www.softperience.eu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers