Re: [HACKERS] additional json functionality

2013-11-21 Thread Maciej Gajewski
Hi everyone

I used to work on a project storing large quantities of schema-less data,
initially using MongoDB, then Postgres with JSON, and eventually I
implemented BSON support for Postgres to get the best of both worlds:
https://github.com/maciekgajewski/postgresbson

I don't think that JSONB is a good idea. There is a lot to learn from
MongoDB's mistakes in this area.

1. As noted in this thread previously, JSON is a serialization format, not
a document format.

2. Almost any structured data type, self-describing or not, can be
serialized to/from JSON, but always using only subset of it, and
interpreting it in it's own specific way.

3. JSON greatest strength is interoperability. It is a great feature of
Postgres that JSON is stored as a text; it's basically a 'text, but you can
do something with it'.  There is many JSON implementations out there, and
one should make no assumption about application's expectations.
For instance: JSON standard (RFS-4627) defines all number to be doubles.
Yet I've seen application storing 64-bit integers (wouldn't fit in double
precision), or even arbitrary precision integers. Most parsers are OK with
that.

4. JSON greatest weakness is performance. Because of 1. it needs to be
parsed before any useful information is extracted.

5. 1. and 3. are mutually exclusive; this is one of the most valuable
takeaways I have from working with Mongo and BSON in particular. BSON is an
attempt to create 'binary JSON', and a failed one. It is a poor
serialization format: faster than JSON, but less flexible. Being binary, it
is strongly typed, and it uses various gimmicks to preserve flexibility:
implicit type casts, 3 different equality comparison functions etc. And
it's not fully convertible to/from JSON; no binary format is.
It is a poor document format as well: it retains some of the JSON's
performance problems: serial nature and ineffective storage.

6. Speed matters to some, and being able to generate binary data in
application and send it to database without any serialization/parsing in
between provides great optimization opportunity. One thing that Mongo guys
got right is the fixed, well-defined binary representation. Application can
use provided library to generate objects, and doesn't need to worry about
server's version or endianess.

In the application I've mention before, switching from JSON to BSON (in
Postgres 9.2, using postgresbson) increased throughput by an order of
magnitude. It was an insert-heavy database with indexes on object fields.
Both serializing in application and desalinizing in server was faster ~10x.

7. It seems to me that JSONB is going to repeat all the mistakes of BSON,
it's going to be 'neither'. If there is an agreement that Postgres needs a
'document' format, why not acknowledge 5., and simply adopt one of the
existing formats. Or even better: adopt none, provide many, provide binary
send/recv and conversion to and from JSON, let the user choose.

The world is full of self-describing binary formats: BSON, MessagePack (
http://msgpack.org/), protobuf, hierarchical H-Store is coming along.
Adding another one would create confusion, and a situation similar to this:
http://xkcd.com/927/


And a side note:

Postgres' greatest and most under-advertised feature is it's extensibility.
People tend to notice only the features present in the core package, while
there should be a huge banner on top of http://www.postgresql.org/: Kids,
we support all data types: we have XML, we have JSON, we have H-store, we
have BSON, and all it with build-in indexing, storage compression and full
transaction support!



Maciej G.


Re: [HACKERS] Review: query result history in psql

2013-07-02 Thread Maciej Gajewski
The query history is stored within the client, so once the user stops
the client, it is gone. But yes, it would be useful to have some tool
that would allow you to see what's in there.

I could be a command (\showans ?) that would list all :ansXXX
variables, together with the query text and  the size of the answer.
It would probably look ugly for very long queries, but could be useful
anyway.

I'm not sure if I'll be able to implement this feature any time soon,
as I'm very busy at the moment and going for a business trip in few
days.

In the meantime, I've applied your suggestions and moved the
sting-manipulating functions to stringutils. Also fixed a tiny bug.
Patch attached.

M


psql-ans.3.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: query result history in psql

2013-07-01 Thread Maciej Gajewski
I'm not really bought into some of the ideas.


but maybe some interactive mode should be usefull - so after
 execution, and showing result, will be prompt if result should be
 saved or not.

 I like the idea, in addition to the ordinary mode. Personally, I would use
 the ordinary mode, but I can see how 'interactive' would be useful.


This would require a complex change to the client code. And the result
would eventually become annoying: an interactive question after each and
every query. Currently, when turned on, every result is stored and simple
notification is printed.



  yes, the names :ans01, :ans02, ... miss semantics - How I can join

 this name (and content) with some SQL query?

 That makes sense. I think having part of / the whole query string would be
 very helpful. Great suggestion!



The naming is obscure and non-informative, I agree. If you have a nice idea
how to make it better, I'd love to discuss it. But please remember that it
has one huge advantage: simplicity. The client is a classical command-line
tool, and as such it delegates some of the functionality to external
programs, like pager or terminal.

I'm pretty sure that your terminal emulator has a 'find' function that
would allow you to quickly locate the variable and associated query in the
scrollback.

M


Re: [HACKERS] Review: query result history in psql

2013-07-01 Thread Maciej Gajewski
 When I tested this feature, I had 30 caches per 5 minutes, and only a
 few from these queries had a sense. Switch between off and on is not
 user friendly. I believe so there can be other solution than mine, but
 a possibility to friendly clean unwanted caches is necessary.


 If you know that you'll need the result of a query beforehand, you can
always use SELECT ... INTO ... . No client-side features required.

This feature is intended for people running plenty of ad-hoc queries, when
every result could potentially be useful.


Re: [HACKERS] Review: query result history in psql

2013-06-28 Thread Maciej Gajewski
Thanks for checking the patch!

So what's left to fix?
* Moving the escaping-related functions to separate module,
* applying your corrections.

Did I missed anything?

I'll submit corrected patch after the weekend.

M


Re: [HACKERS] Review: query result history in psql

2013-06-27 Thread Maciej Gajewski
Thank you for the review!


There were a few english/grammatical mistakes that I went ahead and fixed.


Thank  you for that. If you could send me a patch-to-a-patch so I can
correct all the mistakes in the next release?


 Additionally, I think some of the string manipulation might be placed
 outside of the main ans.c file. I don't know if there's a better place for
 'EscapeForCopy' and 'GetEscapedLen'. Not really a big deal, just an
 organizational idea. I also changed 'EscapeForCopy' to 'EscapeAndCopy'. I
 think that better describes the functionality. 'EscapeForCopy' kind of
 implies that another function is needed to copy the string.



The 'EscapeForCopy' was meant to mean 'Escape string in a format require by
the COPY TEXT format', so 'copy' in the name refers to the escaping format,
not the action performed by the function.

They could be, indeed, placed in separate module. I'll do it.



 What does 'ans' stand for? I am not sure how it relates to the concept of
 a query history. It didn't stop my understanding of the code, but I don't
 know if a user will immediately know the meaning.


Some mathematical toolkits, like Matlab or Mathematica, automatically set a
variable called 'ans' (short for answer) containing the result of the
last operation. I was trying to emulate exactly this behaviour.



 Probably the biggest problem is that the query history list is missing a
 maximum size variable. I think this could be valuable for preventing users
 from shooting themselves in the foot. If the user is running large queries,
 they might accidentally store too much data. This probably somewhat of an
 edge-case but I believe it is worth considering. We could provide a
 sensible default limit (10 queries?) and also allow the user to change it.


I was considering such a behaviour. But since the feature is turned off by
default, I decided that whoever is using it, is aware of cost. Instead of
truncating the history automatically (which could lead to a nasty
surprise), I decided to equip the user with \ansclean , a command erasing
the history. I believe that it is better to let the user decide when
history should be erased, instead of doing it automatically.


Finally, is it worth resetting the query history every time a user
 reconnects to the database? I can see how this might interrupt a user's
 workflow. If the user suddenly disconnects (network connection interrupted,
 etc) then they would lose their history. I think this is definitely up for
 debate. It would add more management overhead (psql options etc) and might
 just be unnecessary. However, with a sane limit to the size of the query
 history, I don't know if there would be too many drawbacks from a storage
 perspective.


The history is not erased. The history is always stored in the client's
memory. When a history item is used for the first time, a TEMPORARY table
is created in the database that stores the data server-side. When user
disconnects from the database, the session ends and all these tables are
dropped.
Tables names have to be removed from the history, so next time the item is
used, the table will be created and populated again.

I use the feature while switching often between databases, and it works
seamlessly. Actually, it's quite useful to move bits of data across
databases:
Connect to database A, run a query, connect to database B, run another
query joining local data with the results of the previous query.


 Those issues aside - I think it's a great feature! I can add the
 grammatical fixes I made whenever the final patch is ready. Or earlier,
 whatever works for you. Also, this is my first time reviewing a patch, so
 please let me know if I can improve on anything. Thanks!


This is  my first submitted patch, so I can't really comment on the
process. But if you could add the author's email to CC, the message would
be much easier to spot. I replied after two days only because I missed the
message in the flood of other pgsql-hacker messages. I think I need to scan
the list more carefully...

Maciej


Re: [HACKERS] [9.4 CF 1] The Commitfest Slacker List

2013-06-24 Thread Maciej Gajewski
Maybe this policy should be mentioned on the Wiki, so newbies like myself
(who wouldn't even dare reviewing patches submitted be seasoned hackers)
are not surprised by seeing own name on a shame wall?

M


Re: [HACKERS] C++ compiler

2013-06-16 Thread Maciej Gajewski
It would be great. I'm working at the moment on porting integer operations
to unsigned types, and the code is essentially a small number of functions,
repeated for every combination of integer types.
In C++ it could be simply one single set of template functions. Less code;
less bugs.


Re: [HACKERS] Unsigned integer types

2013-05-29 Thread Maciej Gajewski
I will implement it as an extension then.

My feeling is that PostgreSQL extensions tend to fall into obscurity.
As an ordinary user it took me really long time to find out that
interesting features are available in form of extensions; they are
certainly under-marketed. But this is a topic for separate discussion.

 You have not at all addressed the real problem with doing what you are asking 
 for, the one that Tom Lane stated:


 Basically, there is zero chance this will happen unless you can find
 a way of fitting them into the numeric promotion hierarchy that doesn't
 break a lot of existing applications.  We have looked at this more than
 once, if memory serves, and failed to come up with a workable design
 that didn't seem to violate the POLA.


I'm sorry, I thought my proposal was clear.

I propose to not integrate the unsigned types into existing promotion
hierarchy, and behave just like gcc would with -Werror: require
explicit cast. Between them, the unsigned types would be automatically
converted up (uint2  uint4  uint8).


Maciek


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unsigned integer types

2013-05-28 Thread Maciej Gajewski
The reasons are: performance, storage and frustration.

I think the frustration comes from the fact that unsigned integers are
universally available, except in PostgreSQL. I work with a really
complex system, with many moving parts, and Postgres really is one of
the components that causes the least trouble (compared to other
opens-source and closed-sourced systems, which I shall leave unnamed),
except for the unsigned integers.

Let me give you few examples:

1. SMALLINT

Probably the most popular unsigned short int on the planet: IP port
number. I had to store some network traffic data in DB; I
instinctively started to prototyping it like this:

CREATE TABLE packets (addr INET, port SMALLINT, ... );

Of course it failed quickly and I had to bump the size to INTEGER. No
real harm here, as the 2 bytes will probably go into some padding
anyway, but somehow it feels wrong.

2. INTEGER

I had to store a record with several uint32. I had to store an awful
lot of them; hundreds GB of data per day. Roughly half of the record
consists of uint32 fields.
Increasing the data type to bigint would mean that I could store 3
instead of 4 days worth of data on available storage.
Continuing with int4 meant that I would have to deal with the data in
special way when in enters and leaves the DB. It's easy in C: just
cast uint32_t to int32_t. But python code requires more complex
changes. And the web backend too...

It's suffering either way!

Just imagine the conversation I had to have with my boss: Either
we'll increase budged for storage, or we need to touch every bit of
the system.

3 .BIGINT

There is no escape from bigint. Numeric (or TEXT!) is the only thing
that can keep uint64, but when you have 10^9 and more records, and you
need to do some arithmetic on it, numeric it's just too slow.

We use uint64 all across our system as unique event identifier. It
works fine, it's fast, and it's very convenient. Passing uint64
around, storing it, looking it up. We use it everywhere, including UI
and log files. So once I decided to use BIGINT to store it, I had to
guard all the inputs and outputs and make sure it is handled
correctly. Or so I though.

It turned out that some guys from different department are parsing
some logs with perl parser and they store it in DB. They choose to
store the uint64 id as TEXT. They probably tried BIGINT and failed and
decided that - since they have low volume and they are not doing any
arithmetics - to store it as TEXT.

And now someone came up with an idea to join one table with another,
bigint with text. I did it. Initially I wrote function that converted
the text to numeric, then rotated it around 2^64 if necessary. It was
too slow. Too slow for something that should be a simple
reinterpretation of data.

Eventually I ended up writing a C function, that first scanf(
%llu)'d the text into uint64_t, and then PG_RETURN_INT64-ed the
uint64_t value. Works fast, but operations hate for increasing the
complexity of DB deployment.

---

I know some cynical people that love this kind of problems, they feel
that the constant struggle is what keeps them employed :) But I'm
ready to use my private time to solve it once and for all.

I'm afraid that implementing uints as and extension would introduce
some performance penalty (I may be wrong). I'm also afraid that with
the extension I'd be left on my own maintaining it forever. While if
this could go into the core product, it would live forever.

As for the POLA violation: programmers experienced with statically
typed languages shouldn't have problems dealing with all the issues
surrounding signed/unsigned integers (like the ones described here:
http://c-faq.com/expr/preservingrules.html). Others don't need to use
them.

Maciek


On 27 May 2013 16:16, Tom Lane t...@sss.pgh.pa.us wrote:
 Maciej Gajewski maciej.gajews...@gmail.com writes:
 The lack of unsigned integer types is one of the biggest sources of
 grief in my daily work with pgsql.
 Before I go and start hacking, I'd like to discuss few points:
 1. Is there a strong objection against merging this kind of patch?

 Basically, there is zero chance this will happen unless you can find
 a way of fitting them into the numeric promotion hierarchy that doesn't
 break a lot of existing applications.  We have looked at this more than
 once, if memory serves, and failed to come up with a workable design
 that didn't seem to violate the POLA.

 2. How/if should the behaviour of numeric literals change?

 The minimalistic solution is: it shouldn't, literals should be assumed
 signed by default. More complex solution could involve using C-style
 suffix ('123456u').

 Well, if you don't do that, there is no need for you to merge anything:
 you can build unsigned types as an external extension if they aren't
 affecting the core parser's behavior.  As long as it's external, you
 don't need to satisfy anybody else's idea of what reasonable behavior
 is ...

 regards, tom lane


-- 
Sent via

[HACKERS] Unsigned integer types

2013-05-27 Thread Maciej Gajewski
Hi all

I know this topic was discussed before, but there doesn't seem to be
any conclusion.

The lack of unsigned integer types is one of the biggest sources of
grief in my daily work with pgsql.

Before I go and start hacking, I'd like to discuss few points:

1. Is there a strong objection against merging this kind of patch?

I can provide numerous reasons why using bigger int or numeric type
just doesn't cut.

2. How/if should the behaviour of numeric literals change?

The minimalistic solution is: it shouldn't, literals should be assumed
signed by default. More complex solution could involve using C-style
suffix ('123456u').

3. How/if should comparing singed and unsigned types work?

IMO they shouldn't be allowed and explicit cast should be required.

Thanks in advance!

Maciek


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch proposal: query result history in psql

2013-05-26 Thread Maciej Gajewski
Polished version of the patch.

* The feature is disabled by default, enabled by backslash command
\ans. Additionaly, \ansclean cleans the result history.
* Escaping is applied when building COPY IN string

This is patch is a diff between master:230e92c and
https://github.com/maciekgajewski/psql-ans.git:2997f9c

Maciek



On 16 May 2013 19:18, David E. Wheeler da...@justatheory.com wrote:
 On May 16, 2013, at 7:02 AM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote:

 I find this feature quite useful, but I understand that my use case
 may be quite unique.

 Just to say that I too find what you've done quite useful. Please add
 your patch to the next commit fest for consideration in 9.4!

 FYI, you can add it here:

   https://commitfest.postgresql.org/action/commitfest_view?id=18

 Best,

 David



psql-ans.2.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Patch proposal: query result history in psql

2013-05-14 Thread Maciej Gajewski
Attached patch contains feature I've implemented for myself, to make
working with huge datasets easier.

I work with large datasets (1E8 - 1E9 records), and the nature of my
work is such that I must dig something out of the data on ad-hoc
basis. I spend a lot of time with psql.

Sometimes a query runs for few minutes. And when the result finally
arrives, sometimes it's too big, contains too much columns or is
ordered incorrectly. Quite often I was thinking to myself: If only I
could run query on the result, instead of having to re-run the
original query and wait few more minutes

Eventually I just wrote the feature. I use it every day now and I'm
really happy with it.

*How it works*

After query result arrives, a message is displayed : Query result
stored as :andN, where N is ordinal number.

User can then use the pseudo-variable :ansN as a relation name in
subsequent queries.

Under the hood, all the query results are stored locally, and when
:asnN variable is used, temporary table is created and populated with
the data. The variable then expands to the table's name.

Sample session:

===

anstest=# select * from quotes limit 10;
date| open  | high  |  low  | close |  volume  | adjclose
+---+---+---+---+--+--
 2013-05-03 | 22.57 | 22.85 | 22.55 | 22.57 | 45523300 |22.57
 2013-05-02 | 22.25 | 22.32 | 22.15 | 22.32 | 27651500 |22.32
 2013-05-01 | 22.10 | 22.35 | 22.10 | 22.15 | 39201600 |22.15
 2013-04-30 | 22.29 | 22.38 | 22.21 | 22.29 | 34054800 |22.29
 2013-04-29 | 22.31 | 22.32 | 22.00 | 22.27 | 36531800 |22.27
 2013-04-26 | 21.98 | 22.40 | 21.97 | 22.21 | 47012500 |22.21
 2013-04-25 | 22.21 | 22.23 | 21.91 | 21.95 | 41462900 |21.95
 2013-04-24 | 21.69 | 22.03 | 21.65 | 21.96 | 51496600 |21.96
 2013-04-23 | 21.55 | 21.69 | 21.36 | 21.50 | 65489600 |21.50
 2013-04-22 | 21.67 | 21.68 | 21.11 | 21.35 | 87787900 |21.35
(10 rows)

Query result stored as :ans0
anstest=# select date,close from :ans0 where date  '2013-05-01';
date| close
+---
 2013-04-30 | 22.29
 2013-04-29 | 22.27
 2013-04-26 | 22.21
 2013-04-25 | 21.95
 2013-04-24 | 21.96
 2013-04-23 | 21.50
 2013-04-22 | 21.35
(7 rows)

Query result stored as :ans1
anstest=# select * from :ans1 order by date;
date| close
+---
 2013-04-22 | 21.35
 2013-04-23 | 21.50
 2013-04-24 | 21.96
 2013-04-25 | 21.95
 2013-04-26 | 22.21
 2013-04-29 | 22.27
 2013-04-30 | 22.29
(7 rows)

Query result stored as :ans2

===

I find this feature quite useful, but I understand that my use case
may be quite unique.

If maintainers think that this is something that could be useful for
general public, I'm ready to polish any rough edges of the attached
patch, to make it suitable for inclusion.

Because the feature introduces some overhead, it should probably be
turned off by default and turned on by backslash command and/or
command-line parameter.

Maciek


psql-ans.1.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers