Re: [HACKERS] Tips/advice for implementing integrated RESTful HTTP API
Hmm yes I am learning that the BG worker system isn't as helpful as I had hoped due to the single database restriction. As for a writing a frontend this might be the best solution. A java frontend would be easy but pointless because the whole point here is to provide a lightweight access method to the database for environments that don't have the ability to use the jdbc or libpq libraries. Deploying a java setup would be too much trouble. I do see now that PG uses one worker per connection rather than a worker pool as I had thought before. So there's nothing already in there to help me dispatch requests and making my own worker pool that distributes requests using sockets wouldn't be any better than connecting back using libpq. A C frontend using libevent would be easy enough to make and deploy for this I guess. But... Maybe nobody really wants this thing anyway, there seem to be some other options out there already. Thanks for the feedback. On Aug 31, 2014 8:46 PM, Craig Ringer cr...@2ndquadrant.com wrote: On 08/31/2014 12:40 PM, Dobes Vandermeer wrote: 1. Connecting to multiple databases The background workers can apparently only connect to a single database at a time, but I want to expose all the databases via the API. bgworkers are assigned a database at launch time (if SPI is enabled), and this database may not change during the worker's lifetime, same as a normal backend. Sometimes frustrating, but that's how it is. I think I could use libpq to connect to PostgreSQL on localhost but this might have weird side-effects in terms of authentication, pid use, stuff like that. If you're going to do that, why use a bgworker at all? In general, what do you gain from trying to do this within the database server its self, not as an app in front of the DB? I could probably manage a pool of dynamic workers (as of 9.4), one per user/database combination or something along those lines. Even one per request? Is there some kind of IPC system in place to help shuttle the requests and responses between dynamic workers? Or do I need to come up with my own? The dynamic shmem code apparently has some queuing functionality. I haven't used it yet. It seems like PostgreSQL itself has a way to shuttle requests out to workers, is it possible to tap into that system instead? Basically some way to send the requests to a PostgreSQL backend from the background worker? It does? It's not the SPI, that executes work directly within the bgworker, making it behave like a normal backend for the purpose of query execution. Or perhaps I shouldn't do this as a worker but rather modify PostgreSQL itself and do it in a more integrated/destructive manner? Or just write a front-end. The problem you'd have attempting to modify PostgreSQL its self for this is that connection dispatch occurs via the postmaster, which is a single-threaded process that already needs to do a bit of work to keep an eye on how things are running. You don't want it constantly busy processing and dispatching millions of tiny HTTP requests. It can't just hand a connection off to a back-end immediately after accepting it, either; it'd have to read the HTTP headers to determine what database to connect to. Then launch a new backend for the connection, which is horribly inefficient when doing tiny short-lived connections. The postmaster has no concept of a pool of backends (unfortunately, IMO) to re-use. I imagine (it's not something I've investigated, really) that you'd want a connection accepter process that watched the listening http request socket. It'd hand connections off to dispatcher processes that read the message content to get the target DB and dispatch the request to a worker backend for the appropriate user/db combo, then collect the results and return them on the connection. Hopefully at this point you're thinking that sounds a lot like a connection pool... because it is. An awfully complicated one, probably, as you'd have to manage everything using shared memory segments and latches. In my view it's unwise to try to do this in the DB with PostgreSQL's architecture. Hack PgBouncer or PgPool to do what you want. Or write a server with Tomcat/Jetty using JAX-RS and PgJDBC and the built in connection pool facilities - you won't *believe* how easy it is. 3. Parallelism The regular PostgreSQL server can run many queries in parallel Well, one PostgreSQL instance (postmaster) may have many backends, each of which may run queries in series but not in parallel. Any given process may only run one query at once. but it seems like if I am using SPI I could only run one query at a time - it's not an asynchronous API. Correct. Any help, sage advice, tips, and suggestions how to move forward in these areas would be muchly appreciated! Don't do it with bgworkers. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL
Re: [HACKERS] Tips/advice for implementing integrated RESTful HTTP API
On Mon, Sep 1, 2014 at 7:00 PM, Craig Ringer cr...@2ndquadrant.com wrote: On 09/02/2014 12:50 AM, Dobes Vandermeer wrote: Hmm yes I am learning that the BG worker system isn't as helpful as I had hoped due to the single database restriction. As for a writing a frontend this might be the best solution. A java frontend would be easy but pointless because the whole point here is to provide a lightweight access method to the database for environments that don't have the ability to use the jdbc or libpq libraries. Deploying a java setup would be too much trouble. If you can't run libpq, you can't run *anything* really, it's very lightweight. I think you misunderstood what I was saying; I'm talking about it acting as a proxy for HTTP-based requests, running on or in front of the PostgreSQL server like a server-side connection pool would. I was just referring to an environment that doesn't have a binding to libpq or JSBC, for example node.js for a long time had no postgresql client so I didn't use PostgreSQL when I used node.js. Same idea as PgBouncer or PgPool. The advantage over hacking PgBouncer/PgPool for the job is that Tomcat can already do a lot of what you want using built-in, pre-existing functionality. Connection pool management, low level REST-style HTTP processing, JSON handling etc are all done for you. Yeah, those are nice conveniences but I still think installing Java and getting something to run on startup is a bit more of a hurdle. Better maek life easier up front by having a simple standalone proxy you can compile and run with just whatever is already available on a typical AWS ubuntu environment. A C frontend using libevent would be easy enough to make and deploy for this I guess. But... Maybe nobody really wants this thing anyway, there seem to be some other options out there already. It's something I think would be interesting to have, but IMO to be really useful it'd need to support composing object graphs as json, a json query format, etc. So you can say get me this customer with all their addresses and contact records without having to issue a bunch of queries (round trips) or use ORM-style left-join-and-deduplicate hacks that waste bandwidth and are messy and annoying. If the SQL outputs rows with ARRAY and JSON type columns in them then that may be sufficient to construct whatever kind of JSON structure you want for the query result. I'm not sure why ORMs don't take better advantage of this; maybe they're just too cross-DB or maybe this feature isn't as powerful as I think it is? PostgreSQL also allows you to query and index fields inside of a json value, so at least initially you can get all this power without inventing any new query language. But later a translator could be made, like an ORM-ish thingy, that might have less clutter than the SQL one because some shorthand could be used for peeking inside the JSON structures. Close care to security and auth would also need to be taken. You don't want to be sending a username/password with each request; you need a reasonable authentication token system, request signing to prevent replay attacks, idempotent requests, etc. Well, these would be needed for use cases where the DB is exposed to untrusted parties, which has never been the case on projects I've worked on. I wouldn't be against these sorts of improvements if people want to make them, but wouldn't matter much to me. I was hoping to re-use postgres built-in password/ident security system. Cheers, Dobes
[HACKERS] Tips/advice for implementing integrated RESTful HTTP API
A while back I was working on a little proposal to create a RESTful HTTP front-end for PostgreSQL and recently I had the inspiration to work on this. So I successfully created a background worker for PostgreSQL 9.3 that can use the SPI to list off the databases in a JSON response. WIP on github: https://github.com/dobesv/restgres/ Now I'm getting into murkier waters and I'm wonder if I can get some helpful tips to guide my RD here. 1. Connecting to multiple databases The background workers can apparently only connect to a single database at a time, but I want to expose all the databases via the API. I think I could use libpq to connect to PostgreSQL on localhost but this might have weird side-effects in terms of authentication, pid use, stuff like that. I could probably manage a pool of dynamic workers (as of 9.4), one per user/database combination or something along those lines. Even one per request? Is there some kind of IPC system in place to help shuttle the requests and responses between dynamic workers? Or do I need to come up with my own? It seems like PostgreSQL itself has a way to shuttle requests out to workers, is it possible to tap into that system instead? Basically some way to send the requests to a PostgreSQL backend from the background worker? Or perhaps I shouldn't do this as a worker but rather modify PostgreSQL itself and do it in a more integrated/destructive manner? 2. Authentication I was trying to use a function md5_crypt_verify to authenticate the user using their password, and I believe I am providing the right password but it's not being accepted. Any tips on authenticating users in a background worker? Where should I be looking for examples? 3. Parallelism The regular PostgreSQL server can run many queries in parallel, but it seems like if I am using SPI I could only run one query at a time - it's not an asynchronous API. This seems related to the multiple databases issue - either I could use libpq to translate/forward requests onto PostgreSQL's own worker system or setup my own little worker pool to run the requests in parallel and have a way to send the request/response data to/from those workers. Any help, sage advice, tips, and suggestions how to move forward in these areas would be muchly appreciated! Regards, Dobes
Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)
On Wed, Apr 4, 2012 at 6:26 AM, Josh Berkus j...@agliodbs.com wrote: While I was doing this I always thought this would have been a better approach for my previous project, an accounting application. If I could just have stored entities like invoice customer as a single document that is inserted, updated, etc. atomically it would be a lot simpler and faster than having to break things out into columns and rows spread over various tables. Actually, an accounting application is the *worst* candidate for document-oriented storage. I guess I didn't go enough into detail. As it's a small business bookkeeping system the records are added after the fact. Constraint checking isn't a priority; we would allow someone to enter sales before purchases and things like that which means the constraint checking has to be more about flagging issues (we didn't get around to that yet, either). It wasn't an ERP and didn't support inventory so there's no worry about searching for inventory counts in particular locations. The idea is to input source documents like invoices receipts and generate reports for stakeholders. I think there is something to be gained by having a first-class concept of a document in the database. It might save some trouble managing parent/child relations, versioning, things like that. I hand-craft some materialized views back then, too, since the conversion from a document (like an invoice) to the actual impact of that on account ledgers and balances was non-trivial and evolving as the feature set expanded, so it wasn't something you wanted to try and build into your reporting queries. Yes, having documents *in addition* to relational data gives you the best of both worlds. You can use relational structures to store data which is well-defined and business-critical, and document structures to store data which is undefined and not critical. Well that's exactly what I was trying to get at in the first place :-). I'd love to see this kind of functionality in PostgreSQL and I think materialized views are a pretty powerful way to do that when you are automatically pulling fields out of the document to make the relational tables. So I kind of think the document database kind of bridges the gap between an OODBMS and the RDBMS because the document is like a little cluster of OODBMS != DocumentDB Yes, I know. I was just saying that a document DB is a bit more OO because the document itself is stored as an object graph rather than just tuples. Thus it fits in between RDBMS and OODBMS in a way. It makes sense in my head somehow, no need to agree with me on this one. Regards, Dobes
[HACKERS] Http Frontend implemented using pgsql?
I had a thought that it might be interesting to have a simple C fronted that converts HTTP to and from some pgsql friendly structure and delegates all the core logic to a stored procedure in the database. This might make it easier to hack on the API without worrying about memory management and buffer overflow vulnerabilities. Is this a brain wave or a brain fart?
Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)
On Sat, Mar 31, 2012 at 1:44 AM, Daniel Farina dan...@heroku.com wrote: On Fri, Mar 30, 2012 at 10:21 AM, Daniel Farina dan...@heroku.com wrote: Any enhancement here that can't be used with libpq via, say, drop-in .so seems unworkable to me, and that's why any solution that is basically proxying to the database is basically a non-starter outside the very earliest prototyping stages. The tuple scanning and protocol semantics can and even should remain the same, especially at first. I should add: proxying could work well if libpq had all the right hooks. The server could remain ignorant. Regardless, upstream changes result. Just to be clear, what you are saying that writing a process that accepts requests by HTTP and translates them into requests using the existing protocol to send to the server would have unacceptable performance? Or is there something else about it that is a non-starter?
Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)
On Fri, Mar 30, 2012 at 4:30 PM, Daniel Farina dan...@heroku.com wrote: On Thu, Mar 29, 2012 at 10:55 PM, Dobes Vandermeer dob...@gmail.com wrote: Virtual hosts. Same port. I think SPDY or like-protocols [...] give a crisp treatment to interactive, stateful workloads involving back-and-forth between client and server with multiplexing, fixing some problems with the hacks in HTTP-land from before. It sounds like at some level you're really talking about replacing the built-in protocol with SPDY because SPDY is possibly a better baseline than updating the existing protocol. That's an interesting idea, I think this project could evolve in that direction if there's demand for it. If only so there is a smaller set of arbitrary decisions to make about how to delimit messages...but if SPDY doesn't get widely deployed, or exacts an unacceptable performance penalty, it is game over. Well, in our case HTTP is a clear win (but not replacement) and SPDY a potential one (even as a replacement). Even if SPDY is not widely adopted it could still replace FEBE if there's a clear advantage to using it, I don't know enough to make the call right now.
Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)
On Fri, Mar 30, 2012 at 9:57 PM, Merlin Moncure mmonc...@gmail.com wrote: On Thu, Mar 29, 2012 at 10:04 AM, Andrew Dunstan and...@dunslane.net wrote: 1. I've been in discussion with some people about adding simple JSON extract functions. We already have some (i.e. xpath()) for XML. Your json work is a great start in marrying document level database features with a relational backend. My take is that storing rich data inside the database in json format, while tempting, is generally a mistake. Unless the document is black box it should be decomposed and stored relationally and marked back up into a document as it goes out the door. I found storing documents (with no schema) in CouchDB very sexy. For analytics purposes it's very handy to throw whatever data you've got into a document and save it to the database and sort out the schema later, when you have a need for it. It could be you want to allow hundreds of different fields in the document, and the document has a hierarchal structure, and so on ... and yet there's no interest in spending time figuring out how to map all that effectively into a SQL schema. Some things might start out a black box but become interesting for reporting purposes later. While I was doing this I always thought this would have been a better approach for my previous project, an accounting application. If I could just have stored entities like invoice customer as a single document that is inserted, updated, etc. atomically it would be a lot simpler and faster than having to break things out into columns and rows spread over various tables. Some fields are just information that isn't searched for or used relationally but ends up getting a field (or a whole bunch of fields, like your address) in the DB schema anyway. Adding, removing, and changing columns is always scary to me in the SQL database - when there's no schema you can fix these things as part of the definition of your view instead. So I kind of think the document database kind of bridges the gap between an OODBMS and the RDBMS because the document is like a little cluster of objects and then you populate your relations by mapping a view of those documents and use that for analysis and search. What I'm saying is that jsonpath probably isn't the whole story: another way of bulk moving json into native backend structures without parsing would also be very helpful. For example, being able to cast a json document into a record or a record array would be just amazing. Hmm whatever that exactly is, it does sound cool!
[HACKERS] HTTP Frontend? (and a brief thought on materialized views)
Hi guys, Something from Josh's recent blog post about summer of code clicked with me - the HTTP / SQL concept. It was something I'd been thinking about earlier, how people really like HTTP APIs and this is one of the drivers behind adoption of some NoSQL databases out there. Some things that I think are great about NoSQL that are missing in PostgreSQL are: 1. The reduce function which could be modeled as a kind of materialized view with an index on it. With materialized views and the ability to pull fields from JSON and XML data structures you could easily do a NoSQL database inside of PostgreSQL by saving the document as a big string and then using materialized views to have all kinds of derived data from those documents. Sounds cool on paper, anyway. 2. HTTP RESTful API. This is obviously not as useful as a good admin tool like pgAdmin or a fast binary protocol like we have now but it's way more buzzword compliant and would come in handy sometimes. With CouchDB I was able to allow public data in the database to be accessed directly from the browser using JSONP and/or a simple HTTP proxy in the server instead of doing any of that work within the app server. So, that saves a step somewhere. With some basic support for stored procedures and serving files directly via this HTTP thing you could potentially eliminate the app server for public, read-only parts of a site/application. 3. The perception of extremely low latency and/or high throughput - like fetching a row in less than 1ms or whatever. I don't know the exact numbers I just know they have to be insanely low (latency) or impressively high (throughput). We could use PostgreSQL as a query cache to accelerate your MySQL! Something like that :-P. 4. Rebelliousness against the man. In our case SQL can't be the man, but we can find something PostgreSQL isn't and position against that. Anyway, 12 seem inspiring to me and I'd love to help out with both. 34 seem more like marketing tasks of some sort... I think I could probably start hacking up an HTTP API of some sort, I wasn't able to log into the PostgreSQL Wiki so I created a page with some notes here: http://dobesv.com/docs/postgresql-http-api.html For materialized views I think that can be a bit of a can of worms, especially if you want to do incremental updates of some sort because you have to figure out what rows need to be recalculated when you update a row in a source table, and how best to update them. But if someone thinks they know how I can assist in implementation. Anyway, I hope I can help AND that I posted this in the correct mailing list! Cheers, Dobes
Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)
On Thu, Mar 29, 2012 at 11:04 PM, Andrew Dunstan and...@dunslane.netwrote: On 03/29/2012 10:37 AM, Dobes Vandermeer wrote: Hi guys, Something from Josh's recent blog post about summer of code clicked with me - the HTTP / SQL concept. 1. I've been in discussion with some people about adding simple JSON extract functions. We already have some (i.e. xpath()) for XML. 2. You might find htsql http://htsql.org/ interesting. As a reference, or should we just bundle / integrate that with PostgreSQL somehow?
Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)
On Fri, Mar 30, 2012 at 3:59 AM, Daniel Farina dan...@heroku.com wrote: On Thu, Mar 29, 2012 at 12:57 PM, Daniel Farina dan...@heroku.com wrote: More technical concerns: * Protocol compression -- but a bit of sand in the gears is *which* compression -- for database workloads, the performance of zlib can be a meaningful bottleneck. I think if performance is the issue, people should use the native protocol. This HTTP thing should be more of a RAD / prototyping thing, I think. So people can be in their comfort zone when talking to the server. * Something similar to the HTTP Host header, so that one can route to databases without having to conflate database identity with the actual port being connected to. Yes, theoretically it can be done with weird startup packet gyrations, but that is firmly in the weird category. Isn't the URL good enough (/databases/dbname) or are you talking about having some some of virtual host setup where you have multiple sets of databases available on the same port? Socialish (but no less important): * A standard frame extension format. For example, last I checked Postgres-XC, it required snapshot information to be passed, and it'd be nice that instead of having to hack the protocol that they could attach an X-Snapshot-Info header, or whatever. This could also be a nice way to pass out-of-band hint information of all sorts. I am sorry to admit I don't understand the terms frame extension format or Postgres-XC in this paragraph ... help? * HTTP -- and *probably* its hypothetical progeny -- are more common than FEBE packets, and a lot of incidental complexity of writing routers is reduced by the commonality of routing HTTP traffic. This is an interesting comment. I'm not sure how to test whether an HTTP based protocol will be better supported than a proprietary one, but I think we have enough other reasons that we can move forward. Well we have the reason that there's some kind of love affair with HTTP based protocols going on out there ... might as well ride the wave while it's still rising (I hope). As for SPDY I can see how it may be helpful but as it is basically a different way to send HTTP requests (from what I understand) the migration to SPDY is mainly a matter of adding support for it to whatever HTTP library is used. Anyone have a thought on whether, for the HTTP server itself, it should be integrated right into the PostgreSQL server itself? Or would it be better to have a separate process that proxies requests to PostgreSQL using the existing protocol? Is there an API that can be used in both cases semi-transparently (i.e. the functions have the same name when linked right in, or when calling via a socket)? Cheers, Dobes
Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)
On Fri, Mar 30, 2012 at 3:57 AM, Daniel Farina dan...@heroku.com wrote: On Thu, Mar 29, 2012 at 8:04 AM, Andrew Dunstan and...@dunslane.net wrote: Lastly, a case that can not as easily be fixed without some more thinking is leveraging caching semantics of HTTP. think people would really, really like that, if they could get away from having to hand-roll their own cache regeneration in common cases. I think this could be an interesting possibility. I wonder if the cost of a round-trip makes the cost of sending the actual data (vs a 304 response) irrelevant - as long as PostgreSQL is caching effectively internally it's possible it can send back the actual content as fast as it can calculate the ETag for it, so doing an extra query to check for changes could possibly slow things down, or at least eliminate the benefit. Probably worth trying, though. Supporting this automagically would require some kind of generic algorithm for calculating the Last-Modifed time or ETag for a given query. As a default we may be able to just fall back on some internal global value that is guaranteed to change if the database has changed (I think the WAL files have some kind of serial number system we might use) so at the very least you could send back a 304 Not Modified if literally nothing in the database has changed. Narrowing that down to specific table timestamps might be possible, too, for simple queries. It depends what data is already available, I wouldn't want to add any extra book keeping for it. A more pragmatic may be to have the HTTP request include SQL code to generate an ETag or Last-Modified value to test with; the app could run that first and it would be used for caching. Something like calcLastModified=max(modified_date) on the query string.
Re: [HACKERS] HTTP Frontend? (and a brief thought on materialized views)
On Fri, Mar 30, 2012 at 10:55 AM, Daniel Farina dan...@heroku.com wrote: On Thu, Mar 29, 2012 at 6:37 PM, Dobes Vandermeer dob...@gmail.com wrote: On Fri, Mar 30, 2012 at 3:59 AM, Daniel Farina dan...@heroku.com wrote: On Thu, Mar 29, 2012 at 12:57 PM, Daniel Farina dan...@heroku.com wrote: More technical concerns: * Protocol compression -- but a bit of sand in the gears is *which* compression -- for database workloads, the performance of zlib can be a meaningful bottleneck. I think if performance is the issue, people should use the native protocol. No. I do not think so. I think a reasonable solution (part of what MS is actually proposing to the IETF) is to make compression optional, or have clients support an LZ77 family format like Snappy. I get the impression that SPDY is waffling a little on this fact, it mandates compression, and definitely zlib, but is less heavy handed about pushing, say Snappy. Although I can understand why a Google-originated technology probably doesn't want to push another Google-originated implementation so hard, it would have been convenient for me for Snappy to have become a more common format. Isn't the URL good enough (/databases/dbname) or are you talking about having some some of virtual host setup where you have multiple sets of databases available on the same port? Virtual hosts. Same port. In that case, the frontend would not be tied to a specific PostgreSQL server, then? I think initially this might complicate things a bit, and you could solve it by putting an HTTP proxy in front to do the virtual hosts for you. * A standard frame extension format. For example, last I checked Postgres-XC, it required snapshot information to be passed, and it'd be nice that instead of having to hack the protocol that they could attach an X-Snapshot-Info header, or whatever. This could also be a nice way to pass out-of-band hint information of all sorts. I am sorry to admit I don't understand the terms frame extension format or Postgres-XC in this paragraph ... help? It'd be nice if it wasn't necessary to do that and they had a much easier path to multiplex additional information into the connection. Ah, I get it - you want a way to add some extra information to the protocol in a backwards compatible way. HTTP (and SPDY) provides a standard way to do that. Makes sense. For my own purposes, I'm intensely interest in lacing the connection with: * EXPLAIN ANALYZE * Partition IDs * Read-only vs. Write workload I'll make a note of these and hash out the details a bit more once there's something working to add them to. As for SPDY I can see how it may be helpful but as it is basically a different way to send HTTP requests I think SPDY or like-protocols [...] give a crisp treatment to interactive, stateful workloads involving back-and-forth between client and server with multiplexing, fixing some problems with the hacks in HTTP-land from before. It sounds like at some level you're really talking about replacing the built-in protocol with SPDY because SPDY is possibly a better baseline than updating the existing protocol. That's an interesting idea, I think this project could evolve in that direction if there's demand for it.