Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-16 Thread Theodore Ts'o
On Wed, Jan 15, 2014 at 10:35:44AM +0100, Jan Kara wrote:
> Filesystems could in theory provide facility like atomic write (at least up
> to a certain size say in MB range) but it's not so easy and when there are
> no strong usecases fs people are reluctant to make their code more complex
> unnecessarily. OTOH without widespread atomic write support I understand
> application developers have similar stance. So it's kind of chicken and egg
> problem. BTW, e.g. ext3/4 has quite a bit of the infrastructure in place
> due to its data=journal mode so if someone on the PostgreSQL side wanted to
> research on this, knitting some experimental ext4 patches should be doable.

For the record, a researcher (plus is PhD student) at HP Labs actually
implemented a prototype based on ext3 which created an atomic write
facility.  It was good up to about 25% of the ext4 journal size (so, a
couple of MB), and it was use to research using persistent memory by
creating a persistent heap using standard in-memory data structures as
a replacement for using a database.

The results of their research work was that showed that ext3 plus
atomic write plus standard Java associative arrays beat using Sqllite.

It was a research prototype, so they didn't handle OOM kill
conditions, and they also didn't try benchmarking against a real
database instead of a toy database such as SqlLite, but if someone
wants to experiment with Atomic write, there are patches against ext3
that we can probably get from HP Labs.

  - Ted

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Re: [Lsf-pc] [HACKERS] Linux kernel impact on PostgreSQL performance

2014-01-13 Thread Theodore Ts'o
The issue with O_DIRECT is actually a much more general issue ---
namely, database programmers that for various reasons decide they
don't want to go down the O_DIRECT route, but then care about
performance.  PostgreSQL is not the only database which is had this

There are two papers at this year's FAST conference about the "Journal
of Journal" (JoJ) problem, which has been triggered by the use of SQLite on
android handsets, and its write patterns, some of which some folks
(including myself) have characterized as "abusive".  (As in, when the
database developer says to the kernel developer, "Doctor, doctor, it
hurts when I do that...")

The program statement for JoJ was introduced in last year's Usenix ATC
conference, I/O Stack Optimizations for Smartphones[1]


The high order bit is what's the right thing to do when database
progammers come to kernel engineers saying, we want to do  and
the performance sucks.  Do we say, "Use O_DIRECT, dummy", not
withstanding Linus's past comments on the issue?  Or do we have some
general design principles that we tell database engineers that they
should do for better performance, and then all developers for all of
the file systems can then try to optimize for a set of new API's, or
recommended ways of using the existing API's?

Surely the wrong answer is that we do things which encourage people to
create entire new specialized file systems for different databases.
The f2fs file system was essentially created because someone thought
it was easier to create a new file system from sratch instad of trying
to change how SQLite or some other existing file system works.
Hopefully we won't have companies using MySQL and PostgreSQL deciding
they need their own mysqlfs and postgresqlfs!  :-)


- Ted

Sent via pgsql-hackers mailing list (
To make changes to your subscription: