Re: [HACKERS] additional json functionality
Hi everyone I used to work on a project storing large quantities of schema-less data, initially using MongoDB, then Postgres with JSON, and eventually I implemented BSON support for Postgres to get the best of both worlds: https://github.com/maciekgajewski/postgresbson I don't think that JSONB is a good idea. There is a lot to learn from MongoDB's mistakes in this area. 1. As noted in this thread previously, JSON is a serialization format, not a document format. 2. Almost any structured data type, self-describing or not, can be serialized to/from JSON, but always using only subset of it, and interpreting it in it's own specific way. 3. JSON greatest strength is interoperability. It is a great feature of Postgres that JSON is stored as a text; it's basically a 'text, but you can do something with it'. There is many JSON implementations out there, and one should make no assumption about application's expectations. For instance: JSON standard (RFS-4627) defines all number to be doubles. Yet I've seen application storing 64-bit integers (wouldn't fit in double precision), or even arbitrary precision integers. Most parsers are OK with that. 4. JSON greatest weakness is performance. Because of 1. it needs to be parsed before any useful information is extracted. 5. 1. and 3. are mutually exclusive; this is one of the most valuable takeaways I have from working with Mongo and BSON in particular. BSON is an attempt to create 'binary JSON', and a failed one. It is a poor serialization format: faster than JSON, but less flexible. Being binary, it is strongly typed, and it uses various gimmicks to preserve flexibility: implicit type casts, 3 different equality comparison functions etc. And it's not fully convertible to/from JSON; no binary format is. It is a poor document format as well: it retains some of the JSON's performance problems: serial nature and ineffective storage. 6. Speed matters to some, and being able to generate binary data in application and send it to database without any serialization/parsing in between provides great optimization opportunity. One thing that Mongo guys got right is the fixed, well-defined binary representation. Application can use provided library to generate objects, and doesn't need to worry about server's version or endianess. In the application I've mention before, switching from JSON to BSON (in Postgres 9.2, using postgresbson) increased throughput by an order of magnitude. It was an insert-heavy database with indexes on object fields. Both serializing in application and desalinizing in server was faster ~10x. 7. It seems to me that JSONB is going to repeat all the mistakes of BSON, it's going to be 'neither'. If there is an agreement that Postgres needs a 'document' format, why not acknowledge 5., and simply adopt one of the existing formats. Or even better: adopt none, provide many, provide binary send/recv and conversion to and from JSON, let the user choose. The world is full of self-describing binary formats: BSON, MessagePack ( http://msgpack.org/), protobuf, hierarchical H-Store is coming along. Adding another one would create confusion, and a situation similar to this: http://xkcd.com/927/ And a side note: Postgres' greatest and most under-advertised feature is it's extensibility. People tend to notice only the features present in the core package, while there should be a huge banner on top of http://www.postgresql.org/: Kids, we support all data types: we have XML, we have JSON, we have H-store, we have BSON, and all it with build-in indexing, storage compression and full transaction support! Maciej G.
Re: [HACKERS] Review: query result history in psql
The query history is stored within the client, so once the user stops the client, it is gone. But yes, it would be useful to have some tool that would allow you to see what's in there. I could be a command (\showans ?) that would list all :ansXXX variables, together with the query text and the size of the answer. It would probably look ugly for very long queries, but could be useful anyway. I'm not sure if I'll be able to implement this feature any time soon, as I'm very busy at the moment and going for a business trip in few days. In the meantime, I've applied your suggestions and moved the sting-manipulating functions to stringutils. Also fixed a tiny bug. Patch attached. M psql-ans.3.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: query result history in psql
I'm not really bought into some of the ideas. but maybe some interactive mode should be usefull - so after execution, and showing result, will be prompt if result should be saved or not. I like the idea, in addition to the ordinary mode. Personally, I would use the ordinary mode, but I can see how 'interactive' would be useful. This would require a complex change to the client code. And the result would eventually become annoying: an interactive question after each and every query. Currently, when turned on, every result is stored and simple notification is printed. yes, the names :ans01, :ans02, ... miss semantics - How I can join this name (and content) with some SQL query? That makes sense. I think having part of / the whole query string would be very helpful. Great suggestion! The naming is obscure and non-informative, I agree. If you have a nice idea how to make it better, I'd love to discuss it. But please remember that it has one huge advantage: simplicity. The client is a classical command-line tool, and as such it delegates some of the functionality to external programs, like pager or terminal. I'm pretty sure that your terminal emulator has a 'find' function that would allow you to quickly locate the variable and associated query in the scrollback. M
Re: [HACKERS] Review: query result history in psql
When I tested this feature, I had 30 caches per 5 minutes, and only a few from these queries had a sense. Switch between off and on is not user friendly. I believe so there can be other solution than mine, but a possibility to friendly clean unwanted caches is necessary. If you know that you'll need the result of a query beforehand, you can always use SELECT ... INTO ... . No client-side features required. This feature is intended for people running plenty of ad-hoc queries, when every result could potentially be useful.
Re: [HACKERS] Review: query result history in psql
Thanks for checking the patch! So what's left to fix? * Moving the escaping-related functions to separate module, * applying your corrections. Did I missed anything? I'll submit corrected patch after the weekend. M
Re: [HACKERS] Review: query result history in psql
Thank you for the review! There were a few english/grammatical mistakes that I went ahead and fixed. Thank you for that. If you could send me a patch-to-a-patch so I can correct all the mistakes in the next release? Additionally, I think some of the string manipulation might be placed outside of the main ans.c file. I don't know if there's a better place for 'EscapeForCopy' and 'GetEscapedLen'. Not really a big deal, just an organizational idea. I also changed 'EscapeForCopy' to 'EscapeAndCopy'. I think that better describes the functionality. 'EscapeForCopy' kind of implies that another function is needed to copy the string. The 'EscapeForCopy' was meant to mean 'Escape string in a format require by the COPY TEXT format', so 'copy' in the name refers to the escaping format, not the action performed by the function. They could be, indeed, placed in separate module. I'll do it. What does 'ans' stand for? I am not sure how it relates to the concept of a query history. It didn't stop my understanding of the code, but I don't know if a user will immediately know the meaning. Some mathematical toolkits, like Matlab or Mathematica, automatically set a variable called 'ans' (short for answer) containing the result of the last operation. I was trying to emulate exactly this behaviour. Probably the biggest problem is that the query history list is missing a maximum size variable. I think this could be valuable for preventing users from shooting themselves in the foot. If the user is running large queries, they might accidentally store too much data. This probably somewhat of an edge-case but I believe it is worth considering. We could provide a sensible default limit (10 queries?) and also allow the user to change it. I was considering such a behaviour. But since the feature is turned off by default, I decided that whoever is using it, is aware of cost. Instead of truncating the history automatically (which could lead to a nasty surprise), I decided to equip the user with \ansclean , a command erasing the history. I believe that it is better to let the user decide when history should be erased, instead of doing it automatically. Finally, is it worth resetting the query history every time a user reconnects to the database? I can see how this might interrupt a user's workflow. If the user suddenly disconnects (network connection interrupted, etc) then they would lose their history. I think this is definitely up for debate. It would add more management overhead (psql options etc) and might just be unnecessary. However, with a sane limit to the size of the query history, I don't know if there would be too many drawbacks from a storage perspective. The history is not erased. The history is always stored in the client's memory. When a history item is used for the first time, a TEMPORARY table is created in the database that stores the data server-side. When user disconnects from the database, the session ends and all these tables are dropped. Tables names have to be removed from the history, so next time the item is used, the table will be created and populated again. I use the feature while switching often between databases, and it works seamlessly. Actually, it's quite useful to move bits of data across databases: Connect to database A, run a query, connect to database B, run another query joining local data with the results of the previous query. Those issues aside - I think it's a great feature! I can add the grammatical fixes I made whenever the final patch is ready. Or earlier, whatever works for you. Also, this is my first time reviewing a patch, so please let me know if I can improve on anything. Thanks! This is my first submitted patch, so I can't really comment on the process. But if you could add the author's email to CC, the message would be much easier to spot. I replied after two days only because I missed the message in the flood of other pgsql-hacker messages. I think I need to scan the list more carefully... Maciej
Re: [HACKERS] [9.4 CF 1] The Commitfest Slacker List
Maybe this policy should be mentioned on the Wiki, so newbies like myself (who wouldn't even dare reviewing patches submitted be seasoned hackers) are not surprised by seeing own name on a shame wall? M
Re: [HACKERS] C++ compiler
It would be great. I'm working at the moment on porting integer operations to unsigned types, and the code is essentially a small number of functions, repeated for every combination of integer types. In C++ it could be simply one single set of template functions. Less code; less bugs.
Re: [HACKERS] Unsigned integer types
I will implement it as an extension then. My feeling is that PostgreSQL extensions tend to fall into obscurity. As an ordinary user it took me really long time to find out that interesting features are available in form of extensions; they are certainly under-marketed. But this is a topic for separate discussion. You have not at all addressed the real problem with doing what you are asking for, the one that Tom Lane stated: Basically, there is zero chance this will happen unless you can find a way of fitting them into the numeric promotion hierarchy that doesn't break a lot of existing applications. We have looked at this more than once, if memory serves, and failed to come up with a workable design that didn't seem to violate the POLA. I'm sorry, I thought my proposal was clear. I propose to not integrate the unsigned types into existing promotion hierarchy, and behave just like gcc would with -Werror: require explicit cast. Between them, the unsigned types would be automatically converted up (uint2 uint4 uint8). Maciek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unsigned integer types
The reasons are: performance, storage and frustration. I think the frustration comes from the fact that unsigned integers are universally available, except in PostgreSQL. I work with a really complex system, with many moving parts, and Postgres really is one of the components that causes the least trouble (compared to other opens-source and closed-sourced systems, which I shall leave unnamed), except for the unsigned integers. Let me give you few examples: 1. SMALLINT Probably the most popular unsigned short int on the planet: IP port number. I had to store some network traffic data in DB; I instinctively started to prototyping it like this: CREATE TABLE packets (addr INET, port SMALLINT, ... ); Of course it failed quickly and I had to bump the size to INTEGER. No real harm here, as the 2 bytes will probably go into some padding anyway, but somehow it feels wrong. 2. INTEGER I had to store a record with several uint32. I had to store an awful lot of them; hundreds GB of data per day. Roughly half of the record consists of uint32 fields. Increasing the data type to bigint would mean that I could store 3 instead of 4 days worth of data on available storage. Continuing with int4 meant that I would have to deal with the data in special way when in enters and leaves the DB. It's easy in C: just cast uint32_t to int32_t. But python code requires more complex changes. And the web backend too... It's suffering either way! Just imagine the conversation I had to have with my boss: Either we'll increase budged for storage, or we need to touch every bit of the system. 3 .BIGINT There is no escape from bigint. Numeric (or TEXT!) is the only thing that can keep uint64, but when you have 10^9 and more records, and you need to do some arithmetic on it, numeric it's just too slow. We use uint64 all across our system as unique event identifier. It works fine, it's fast, and it's very convenient. Passing uint64 around, storing it, looking it up. We use it everywhere, including UI and log files. So once I decided to use BIGINT to store it, I had to guard all the inputs and outputs and make sure it is handled correctly. Or so I though. It turned out that some guys from different department are parsing some logs with perl parser and they store it in DB. They choose to store the uint64 id as TEXT. They probably tried BIGINT and failed and decided that - since they have low volume and they are not doing any arithmetics - to store it as TEXT. And now someone came up with an idea to join one table with another, bigint with text. I did it. Initially I wrote function that converted the text to numeric, then rotated it around 2^64 if necessary. It was too slow. Too slow for something that should be a simple reinterpretation of data. Eventually I ended up writing a C function, that first scanf( %llu)'d the text into uint64_t, and then PG_RETURN_INT64-ed the uint64_t value. Works fast, but operations hate for increasing the complexity of DB deployment. --- I know some cynical people that love this kind of problems, they feel that the constant struggle is what keeps them employed :) But I'm ready to use my private time to solve it once and for all. I'm afraid that implementing uints as and extension would introduce some performance penalty (I may be wrong). I'm also afraid that with the extension I'd be left on my own maintaining it forever. While if this could go into the core product, it would live forever. As for the POLA violation: programmers experienced with statically typed languages shouldn't have problems dealing with all the issues surrounding signed/unsigned integers (like the ones described here: http://c-faq.com/expr/preservingrules.html). Others don't need to use them. Maciek On 27 May 2013 16:16, Tom Lane t...@sss.pgh.pa.us wrote: Maciej Gajewski maciej.gajews...@gmail.com writes: The lack of unsigned integer types is one of the biggest sources of grief in my daily work with pgsql. Before I go and start hacking, I'd like to discuss few points: 1. Is there a strong objection against merging this kind of patch? Basically, there is zero chance this will happen unless you can find a way of fitting them into the numeric promotion hierarchy that doesn't break a lot of existing applications. We have looked at this more than once, if memory serves, and failed to come up with a workable design that didn't seem to violate the POLA. 2. How/if should the behaviour of numeric literals change? The minimalistic solution is: it shouldn't, literals should be assumed signed by default. More complex solution could involve using C-style suffix ('123456u'). Well, if you don't do that, there is no need for you to merge anything: you can build unsigned types as an external extension if they aren't affecting the core parser's behavior. As long as it's external, you don't need to satisfy anybody else's idea of what reasonable behavior is ... regards, tom lane -- Sent via
[HACKERS] Unsigned integer types
Hi all I know this topic was discussed before, but there doesn't seem to be any conclusion. The lack of unsigned integer types is one of the biggest sources of grief in my daily work with pgsql. Before I go and start hacking, I'd like to discuss few points: 1. Is there a strong objection against merging this kind of patch? I can provide numerous reasons why using bigger int or numeric type just doesn't cut. 2. How/if should the behaviour of numeric literals change? The minimalistic solution is: it shouldn't, literals should be assumed signed by default. More complex solution could involve using C-style suffix ('123456u'). 3. How/if should comparing singed and unsigned types work? IMO they shouldn't be allowed and explicit cast should be required. Thanks in advance! Maciek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch proposal: query result history in psql
Polished version of the patch. * The feature is disabled by default, enabled by backslash command \ans. Additionaly, \ansclean cleans the result history. * Escaping is applied when building COPY IN string This is patch is a diff between master:230e92c and https://github.com/maciekgajewski/psql-ans.git:2997f9c Maciek On 16 May 2013 19:18, David E. Wheeler da...@justatheory.com wrote: On May 16, 2013, at 7:02 AM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote: I find this feature quite useful, but I understand that my use case may be quite unique. Just to say that I too find what you've done quite useful. Please add your patch to the next commit fest for consideration in 9.4! FYI, you can add it here: https://commitfest.postgresql.org/action/commitfest_view?id=18 Best, David psql-ans.2.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Patch proposal: query result history in psql
Attached patch contains feature I've implemented for myself, to make working with huge datasets easier. I work with large datasets (1E8 - 1E9 records), and the nature of my work is such that I must dig something out of the data on ad-hoc basis. I spend a lot of time with psql. Sometimes a query runs for few minutes. And when the result finally arrives, sometimes it's too big, contains too much columns or is ordered incorrectly. Quite often I was thinking to myself: If only I could run query on the result, instead of having to re-run the original query and wait few more minutes Eventually I just wrote the feature. I use it every day now and I'm really happy with it. *How it works* After query result arrives, a message is displayed : Query result stored as :andN, where N is ordinal number. User can then use the pseudo-variable :ansN as a relation name in subsequent queries. Under the hood, all the query results are stored locally, and when :asnN variable is used, temporary table is created and populated with the data. The variable then expands to the table's name. Sample session: === anstest=# select * from quotes limit 10; date| open | high | low | close | volume | adjclose +---+---+---+---+--+-- 2013-05-03 | 22.57 | 22.85 | 22.55 | 22.57 | 45523300 |22.57 2013-05-02 | 22.25 | 22.32 | 22.15 | 22.32 | 27651500 |22.32 2013-05-01 | 22.10 | 22.35 | 22.10 | 22.15 | 39201600 |22.15 2013-04-30 | 22.29 | 22.38 | 22.21 | 22.29 | 34054800 |22.29 2013-04-29 | 22.31 | 22.32 | 22.00 | 22.27 | 36531800 |22.27 2013-04-26 | 21.98 | 22.40 | 21.97 | 22.21 | 47012500 |22.21 2013-04-25 | 22.21 | 22.23 | 21.91 | 21.95 | 41462900 |21.95 2013-04-24 | 21.69 | 22.03 | 21.65 | 21.96 | 51496600 |21.96 2013-04-23 | 21.55 | 21.69 | 21.36 | 21.50 | 65489600 |21.50 2013-04-22 | 21.67 | 21.68 | 21.11 | 21.35 | 87787900 |21.35 (10 rows) Query result stored as :ans0 anstest=# select date,close from :ans0 where date '2013-05-01'; date| close +--- 2013-04-30 | 22.29 2013-04-29 | 22.27 2013-04-26 | 22.21 2013-04-25 | 21.95 2013-04-24 | 21.96 2013-04-23 | 21.50 2013-04-22 | 21.35 (7 rows) Query result stored as :ans1 anstest=# select * from :ans1 order by date; date| close +--- 2013-04-22 | 21.35 2013-04-23 | 21.50 2013-04-24 | 21.96 2013-04-25 | 21.95 2013-04-26 | 22.21 2013-04-29 | 22.27 2013-04-30 | 22.29 (7 rows) Query result stored as :ans2 === I find this feature quite useful, but I understand that my use case may be quite unique. If maintainers think that this is something that could be useful for general public, I'm ready to polish any rough edges of the attached patch, to make it suitable for inclusion. Because the feature introduces some overhead, it should probably be turned off by default and turned on by backslash command and/or command-line parameter. Maciek psql-ans.1.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers