subject:"\[HACKERS\] Hadoop backend\?"

Re: [HACKERS] Hadoop backend?

2009-07-21 Thread Ron Mayer

Paul Sheer wrote:
 Hadoop backend for PostGreSQL

Resurrecting an old thread, it seems some guys at Yale implemented
something very similar to what this thread was discussing.

http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html
 
 It's an open source stack that includes PostgreSQL Hadoop, and Hive, along
 with some glue between PostgreSQL and Hadoop, a catalog, a data loader, and
 an interface that accepts queries in MapReduce or SQL and generates query
 plans that are processed partly in Hadoop and partly in different PostgreSQL
 instances spread across many nodes in a shared-nothing cluster of machines.

Their detailed paper is here:

  http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf

According to the paper, it scales very well.


 A problem that my client has, and one that I come across often,
 is that a database seems to always be associated with a particular
 physical machine, a physical machine that has to be upgraded,
 replaced, or otherwise maintained.
 
 Even if the database is replicated, it just means there are two or
 more machines. Replication is also a difficult thing to properly
 manage.
 
 With a distributed data store, the data would become a logical
 object - no adding or removal of machines would affect the data.
 This is an ideal that would remove a tremendous maintenance
 burden from many sites  well, at least the one's I have worked
 at as far as I can see.
 
 Does anyone know of plans to implement PostGreSQL over Hadoop?
 
 Yahoo seems to be doing this:
   
 http://glinden.blogspot.com/2008/05/yahoo-builds-two-petabyte-postgresql.html
 
 But they store tables column-ways for their performance situation.
 If one is doing a lot of inserts I don't think this is most efficient - ?
 
 Has Yahoo put the source code for their work online?
 
 Many thanks for any pointers.
 
 -paul
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hadoop backend?

2009-02-24 Thread Paul Sheer



 As far as I can tell, the PG storage manager API is at the wrong level
 of abstraction for pretty much everything.  These days, everything we do
 is atop the Unix filesystem API, and anything that smgr might have been



Is there a complete list of filesystem API calls somewhere that I can get my
head around?
At least this will give me an idea of the scope of the effort.

 should be easier to build postgres data feeder in Hadoop (which might even
already exist).

do you know of a url?

-paul

Re: [HACKERS] Hadoop backend?

2009-02-24 Thread Hans-Jürgen Schönig

why not just stream it in via set-returning functions and make sure  
that we can mark a set returning function as STREAMABLE or so (to  
prevent joins, whatever).

is it the easiest way to get it right and it helps in many other cases.
i think that the storage manager is definitely the wrong place to do  
this.


it is also easy to use more than just one backend then if you get the  
interface code right.


regards,

hans


On Feb 24, 2009, at 12:03 AM, Jonah H. Harris wrote:

On Sun, Feb 22, 2009 at 3:47 PM, Robert Haas robertmh...@gmail.com  
wrote:

In theory, I think you could make postgres work on any type of
underlying storage you like by writing a second smgr implementation
that would exist alongside md.c.  The fly in the ointment is that
you'd need a more sophisticated implementation of this line of code,
from smgropen:

   reln-smgr_which = 0;   /* we only have md.c at present */

I believe there is more than that which would need to be done  
nowadays.  I seem to recall that the storage manager abstraction has  
slowly been dedicated/optimized for md over the past 6 years or so.   
It may even be easier/preferred to write a hadoop specific access  
method depending on what you're looking for from hadoop.


--
Jonah H. Harris, Senior DBA
myYearbook.com




--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: www.postgresql-support.de

Re: [HACKERS] Hadoop backend?

2009-02-24 Thread Peter Eisentraut


Tom Lane wrote:

It's interesting to speculate about where we could draw an abstraction
boundary that would be more useful.  I don't think the MySQL guys got it
right either...


The supposed smgr abstraction of PostgreSQL, which tells more or less 
how to get a byte to the disk, is quite far away from what MySQL calls a 
storage engine, which has things like open table, scan table, drop table 
on a logical level (meaning open table, not open heap).


To my judgement, neither of these approaches is terribly useful from a 
current, practical point of view.


In any case, in order to solve the where to abstract question, you'd 
probably want to have one or two other storage APIs that you seriously 
want to integrate, and then you can analyze how to unify them.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hadoop backend?

2009-02-24 Thread Josh Berkus




With a distributed data store, the data would become a logical
object - no adding or removal of machines would affect the data.
This is an ideal that would remove a tremendous maintenance
burden from many sites  well, at least the one's I have worked
at as far as I can see.


Two things:

1) Hadoop is the wrong technology.  It's not designed to support 
transactional operations.


2) Transactional operations are, in general, your Big Obstacle for doing 
anything in the way of a distributed storage manager.


It's possible you could make both of the above go away if you were 
planning for a DW platform in which transactions weren't important. 
However, that would have to become an incompatible fork of PostgreSQL.


AFAIK, the Yahoo platform does not involve Hadoop at all.

--Josh


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hadoop backend?

2009-02-23 Thread Paul Sheer

 It would only be possible to have the actual PostgreSQL backends
 running on a single node anyway, because they use shared memory to

This is not problem: Performance is a secondary consideration (at least
as far as the problem I was referring to).

The primary usefulness is to have the data be a logical entity rather
than a physical entity so that one can maintain physical machines
without having to worry to much about where-is-the-data.

At the moment, most databases suffer from the problem of occasionally
having to move the data from one place to another. This is a major
nightmare that happens once every few years for most DBAs.
It happens because a system needs a soft/hard upgrade, or a disk
enlarged, or because a piece of hardware fails.

I have also found it's no use having RAID or ZFS. Each of these ties
the data to an OS installation. If the OS needs to be reinstalled, all
the data has to be manually moved in a way that is, well... dangerous.

If there is only one machine running postgres that is fine too: I can have
a second identical machine on standby in case of a hardware failure.
That means a short amount of downtime - most people can live
with that.

I read somewhere that replication was one of the goals of postgres's
coming development efforts. Personally I think hadoop might be
a better solution - *shrug*.

Thoughts/comments ??

-paul

Re: [HACKERS] Hadoop backend?

2009-02-23 Thread Robert Haas

On Mon, Feb 23, 2009 at 9:08 AM, Paul Sheer paulsh...@gmail.com wrote:
 It would only be possible to have the actual PostgreSQL backends
 running on a single node anyway, because they use shared memory to

 This is not problem: Performance is a secondary consideration (at least
 as far as the problem I was referring to).

 The primary usefulness is to have the data be a logical entity rather
 than a physical entity so that one can maintain physical machines
 without having to worry to much about where-is-the-data.

 At the moment, most databases suffer from the problem of occasionally
 having to move the data from one place to another. This is a major
 nightmare that happens once every few years for most DBAs.
 It happens because a system needs a soft/hard upgrade, or a disk
 enlarged, or because a piece of hardware fails.

 I have also found it's no use having RAID or ZFS. Each of these ties
 the data to an OS installation. If the OS needs to be reinstalled, all
 the data has to be manually moved in a way that is, well... dangerous.

 If there is only one machine running postgres that is fine too: I can have
 a second identical machine on standby in case of a hardware failure.
 That means a short amount of downtime - most people can live
 with that.

I think the performance aspect of things has to be considered.  If the
system introduces too much overhead it won't be useful in practice.
But apart from that I agree with all of this.

 I read somewhere that replication was one of the goals of postgres's
 coming development efforts. Personally I think hadoop might be
 a better solution - *shrug*.

 Thoughts/comments ??

I think the two are solving different problems.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hadoop backend?

2009-02-23 Thread Andrew Chernow


Paul Sheer wrote

I have also found it's no use having RAID or ZFS. Each of these ties
the data to an OS installation. If the OS needs to be reinstalled, all
the data has to be manually moved in a way that is, well... dangerous.




How about network storage, fiber attach?  If you move the db you only 
need to redirect the LUN(s) to a new WWN.


--
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hadoop backend?

2009-02-23 Thread Markus Wanner

Hi,

Paul Sheer wrote:
 This is not problem: Performance is a secondary consideration (at least
 as far as the problem I was referring to).

Well, if you don't mind your database running .. ehm.. creeping several
orders of magnitudes slower, you might also be interested in
Single-System Image Clustering Systems [1], like Beowulf, Kerrighed [2],
OpenSSI [3], etc..  Besides distributed filesystems, those also provide
transparent shared memory across nodes.

 The primary usefulness is to have the data be a logical entity rather
 than a physical entity so that one can maintain physical machines
 without having to worry to much about where-is-the-data.

There are lots of solutions offering that already. In what way should
Hadoop be better than any of those existing ones? For a slightly
different example, you can get equivalent functionality on the block
devices layer with DRBD [4], which is in successful use for Postgres as
well.

The main challenge with distributed filesystems remains reliable failure
detection and ensuring that only exactly one node is alive at any time.

 At the moment, most databases suffer from the problem of occasionally
 having to move the data from one place to another. This is a major
 nightmare that happens once every few years for most DBAs.
 It happens because a system needs a soft/hard upgrade, or a disk
 enlarged, or because a piece of hardware fails.

You are comparing to standalone nodes here, which doesn't make much
sense, IMO.

 I have also found it's no use having RAID or ZFS. Each of these ties
 the data to an OS installation. If the OS needs to be reinstalled, all
 the data has to be manually moved in a way that is, well... dangerous.

I'm thinking more of Lustre, GFS, OCFS, AFS or some such. Compare with
those!

 If there is only one machine running postgres that is fine too: I can have
 a second identical machine on standby in case of a hardware failure.
 That means a short amount of downtime - most people can live
 with that.

What most people have trouble with is a master that revives and suddenly
confuses the new master (old slave).

 I read somewhere that replication was one of the goals of postgres's
 coming development efforts. Personally I think hadoop might be
 a better solution - *shrug*.

I'm not convinced at all. The trouble is not the replication of the data
on disk, it's rather the data in shared memory which poses the hard
problems (locks, caches, etc..). The former is solved already, the later
is a tad harder to solve. See [5] for my approach (showing my bias).

What I do find interesting about Hadoop is the MapReduce approach, but
lots more than writing another storage backend is required, if you
want to make use of that for Postgres. Greenplum claims to have
implemented MapReduce for their Database [6], however, to me it looks
like that is working a couple of layers above the filesystem.

Regards

Markus Wanner


[1]: Wikipedia: Single-System Image Clustering
http://en.wikipedia.org/wiki/Single-system_image

[2]: http://www.kerrighed.org/

[3]: http://www.openssi.org/

[4]: http://www.drbd.org/

[5]: Postgres-R:
http://www.postgres-r.org/

[6]: Greenplum MapReduce
http://www.greenplum.com/resources/mapreduce/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hadoop backend?

2009-02-23 Thread Jonah H. Harris

On Sun, Feb 22, 2009 at 3:47 PM, Robert Haas robertmh...@gmail.com wrote:

 In theory, I think you could make postgres work on any type of
 underlying storage you like by writing a second smgr implementation
 that would exist alongside md.c.  The fly in the ointment is that
 you'd need a more sophisticated implementation of this line of code,
 from smgropen:

reln-smgr_which = 0;   /* we only have md.c at present */


I believe there is more than that which would need to be done nowadays.  I
seem to recall that the storage manager abstraction has slowly been
dedicated/optimized for md over the past 6 years or so.  It may even be
easier/preferred to write a hadoop specific access method depending on what
you're looking for from hadoop.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com

Re: [HACKERS] Hadoop backend?

2009-02-23 Thread Tom Lane

Jonah H. Harris jonah.har...@gmail.com writes:
 I believe there is more than that which would need to be done nowadays.  I
 seem to recall that the storage manager abstraction has slowly been
 dedicated/optimized for md over the past 6 years or so.

As far as I can tell, the PG storage manager API is at the wrong level
of abstraction for pretty much everything.  These days, everything we do
is atop the Unix filesystem API, and anything that smgr might have been
able to do for us is getting handled in kernel filesystem code or device
drivers.  (Back in the eighties, when it was more plausible for PG to do
direct device access, maybe smgr was good for something; but no more.)

It's interesting to speculate about where we could draw an abstraction
boundary that would be more useful.  I don't think the MySQL guys got it
right either...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hadoop backend?

2009-02-23 Thread pi song

| I believe there is more than that which would need to be done
nowadays.  I seem to recall that the storage manager|
| abstraction has slowly been dedicated/optimized for md over the past 6
years or so.  It may even be easier/preferred
| to write a hadoop specific access method depending on what you're
looking for from hadoop.

I think you're very right. What Postgres needs is access method abstraction.
One should be able to plug in access method for SSD or network file systems
if appropriate. I don't talk about MapReduce bit in Hadoop because I think
that's a different story. What you need for MapReduce are 1) data store
which feeds you data and then 2) MapReduce does the query processing. This
has nothing to share with Postgres query processor in common. If you just
want data from Postgres then it should be easier to build postgres data
feeder in Hadoop (which might even already exist).

Pi Song

On Tue, Feb 24, 2009 at 11:24 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Jonah H. Harris jonah.har...@gmail.com writes:
  I believe there is more than that which would need to be done nowadays.
  I
  seem to recall that the storage manager abstraction has slowly been
  dedicated/optimized for md over the past 6 years or so.

 As far as I can tell, the PG storage manager API is at the wrong level
 of abstraction for pretty much everything.  These days, everything we do
 is atop the Unix filesystem API, and anything that smgr might have been
 able to do for us is getting handled in kernel filesystem code or device
 drivers.  (Back in the eighties, when it was more plausible for PG to do
 direct device access, maybe smgr was good for something; but no more.)

 It's interesting to speculate about where we could draw an abstraction
 boundary that would be more useful.  I don't think the MySQL guys got it
 right either...

regards, tom lane

Re: [HACKERS] Hadoop backend?

2009-02-22 Thread Hans-Jürgen Schönig


hi ...

i think the easiest way to do this is to simply add a mechanism to  
functions which allows a function to stream data through.
it would basically mean losing join support as you cannot read data  
again in a way which is good enough good enough for joining with the  
function providing the data from hadoop.


hannu ( I think) brought up some concept as well some time ago.

i think a straight forward implementation would not be too hard.

best regards,

hans



On Feb 22, 2009, at 3:37 AM, pi song wrote:


1) Hadoop file system is very optimized for mostly read operation
2) As of a few months ago, hdfs doesn't support file appending.

There might be a bit of impedance to make them go together.

However, I think it should a very good initiative to come up with  
ideas to be able to run postgres on distributed file system (doesn't  
have to be specific hadoop).


Pi Song

On Sun, Feb 22, 2009 at 7:17 AM, Paul Sheer paulsh...@gmail.com  
wrote:

Hadoop backend for PostGreSQL

A problem that my client has, and one that I come across often,
is that a database seems to always be associated with a particular
physical machine, a physical machine that has to be upgraded,
replaced, or otherwise maintained.

Even if the database is replicated, it just means there are two or
more machines. Replication is also a difficult thing to properly
manage.

With a distributed data store, the data would become a logical
object - no adding or removal of machines would affect the data.
This is an ideal that would remove a tremendous maintenance
burden from many sites  well, at least the one's I have worked
at as far as I can see.

Does anyone know of plans to implement PostGreSQL over Hadoop?

Yahoo seems to be doing this:
 
http://glinden.blogspot.com/2008/05/yahoo-builds-two-petabyte-postgresql.html

But they store tables column-ways for their performance situation.
If one is doing a lot of inserts I don't think this is most  
efficient - ?


Has Yahoo put the source code for their work online?

Many thanks for any pointers.

-paul

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers




--
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: www.postgresql-support.de

Re: [HACKERS] Hadoop backend?

2009-02-22 Thread Robert Haas

On Sat, Feb 21, 2009 at 9:37 PM, pi song pi.so...@gmail.com wrote:
 1) Hadoop file system is very optimized for mostly read operation
 2) As of a few months ago, hdfs doesn't support file appending.
 There might be a bit of impedance to make them go together.
 However, I think it should a very good initiative to come up with ideas to
 be able to run postgres on distributed file system (doesn't have to be
 specific hadoop).

In theory, I think you could make postgres work on any type of
underlying storage you like by writing a second smgr implementation
that would exist alongside md.c.  The fly in the ointment is that
you'd need a more sophisticated implementation of this line of code,
from smgropen:

reln-smgr_which = 0;   /* we only have md.c at present */

Logically, it seems like the choice of smgr should track with the
notion of a tablespace.  IOW, you might to have one tablespace that is
stored on a magnetic disk (md.c) and another that is stored on your
hypothetical distributed filesystem (hypodfs.c).  I'm not sure how
hard this would be to implement, but I don't think smgropen() is in a
position to do syscache lookups, so probably not that easy.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hadoop backend?

2009-02-22 Thread pi song

One more problem is that data placement on HDFS is inherent, meaning you
have no explicit control. Thus, you cannot place two sets of data which are
likely to be joined together on the same node = uncontrollable latency
during query processing.
Pi Song

On Mon, Feb 23, 2009 at 7:47 AM, Robert Haas robertmh...@gmail.com wrote:

 On Sat, Feb 21, 2009 at 9:37 PM, pi song pi.so...@gmail.com wrote:
  1) Hadoop file system is very optimized for mostly read operation
  2) As of a few months ago, hdfs doesn't support file appending.
  There might be a bit of impedance to make them go together.
  However, I think it should a very good initiative to come up with ideas
 to
  be able to run postgres on distributed file system (doesn't have to be
  specific hadoop).

 In theory, I think you could make postgres work on any type of
 underlying storage you like by writing a second smgr implementation
 that would exist alongside md.c.  The fly in the ointment is that
 you'd need a more sophisticated implementation of this line of code,
 from smgropen:

reln-smgr_which = 0;   /* we only have md.c at present */

 Logically, it seems like the choice of smgr should track with the
 notion of a tablespace.  IOW, you might to have one tablespace that is
 stored on a magnetic disk (md.c) and another that is stored on your
 hypothetical distributed filesystem (hypodfs.c).  I'm not sure how
 hard this would be to implement, but I don't think smgropen() is in a
 position to do syscache lookups, so probably not that easy.

 ...Robert

Re: [HACKERS] Hadoop backend?

2009-02-22 Thread Robert Haas

On Sun, Feb 22, 2009 at 5:18 PM, pi song pi.so...@gmail.com wrote:
 One more problem is that data placement on HDFS is inherent, meaning you
 have no explicit control. Thus, you cannot place two sets of data which are
 likely to be joined together on the same node = uncontrollable latency
 during query processing.
 Pi Song

It would only be possible to have the actual PostgreSQL backends
running on a single node anyway, because they use shared memory to
hold lock tables and things.  The advantage of a distributed file
system would be that you could access more storage (and more system
buffer cache) than would be possible on a single system (or perhaps
the same amount but at less cost).  Assuming some sort of
per-tablespace control over the storage manager, you could put your
most frequently accessed data locally and the less frequently accessed
data into the DFS.

But you'd still have to pull all the data back to the master node to
do anything with it.  Being able to actually distribute the
computation would be a much harder problem.  Currently, we don't even
have the ability to bring multiple CPUs to bear on (for example) a
large sequential scan (even though all the data is on a single node).

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hadoop backend?

2009-02-22 Thread pi song

On Mon, Feb 23, 2009 at 3:56 PM, pi song pi.so...@gmail.com wrote:

 I think the point that you can access more system cache is right but that
 doesn't mean it will be more efficient than accessing from your local disk.
 Take Hadoop for example, your request for file content will have to go to
 Namenode (file chunk indexing service) and then you go ask the data node
 which then provides you data. Assuming that you're working on a large
 dataset, the probability of the data chunk you need staying in system cache
 is very low therefore most of the time you end up reading from a remote
 disk.

 I've got a better idea. How about we make the buffer pool multilevel? The
 first level is the current one. The second level represents memory  from
 remote machines. Things that are used less often should stay on the second
 level. Has anyone ever thought about something like this before?

 Pi Song

 On Mon, Feb 23, 2009 at 1:09 PM, Robert Haas robertmh...@gmail.comwrote:

 On Sun, Feb 22, 2009 at 5:18 PM, pi song pi.so...@gmail.com wrote:
  One more problem is that data placement on HDFS is inherent, meaning you
  have no explicit control. Thus, you cannot place two sets of data which
 are
  likely to be joined together on the same node = uncontrollable latency
  during query processing.
  Pi Song

 It would only be possible to have the actual PostgreSQL backends
 running on a single node anyway, because they use shared memory to
 hold lock tables and things.  The advantage of a distributed file
 system would be that you could access more storage (and more system
 buffer cache) than would be possible on a single system (or perhaps
 the same amount but at less cost).  Assuming some sort of
 per-tablespace control over the storage manager, you could put your
 most frequently accessed data locally and the less frequently accessed
 data into the DFS.

 But you'd still have to pull all the data back to the master node to
 do anything with it.  Being able to actually distribute the
 computation would be a much harder problem.  Currently, we don't even
 have the ability to bring multiple CPUs to bear on (for example) a
 large sequential scan (even though all the data is on a single node).

 ...Robert

[HACKERS] Hadoop backend?

2009-02-21 Thread Paul Sheer

Hadoop backend for PostGreSQL

A problem that my client has, and one that I come across often,
is that a database seems to always be associated with a particular
physical machine, a physical machine that has to be upgraded,
replaced, or otherwise maintained.

Even if the database is replicated, it just means there are two or
more machines. Replication is also a difficult thing to properly
manage.

With a distributed data store, the data would become a logical
object - no adding or removal of machines would affect the data.
This is an ideal that would remove a tremendous maintenance
burden from many sites  well, at least the one's I have worked
at as far as I can see.

Does anyone know of plans to implement PostGreSQL over Hadoop?

Yahoo seems to be doing this:
  
http://glinden.blogspot.com/2008/05/yahoo-builds-two-petabyte-postgresql.html

But they store tables column-ways for their performance situation.
If one is doing a lot of inserts I don't think this is most efficient - ?

Has Yahoo put the source code for their work online?

Many thanks for any pointers.

-paul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hadoop backend?

2009-02-21 Thread pi song

1) Hadoop file system is very optimized for mostly read operation2) As of a
few months ago, hdfs doesn't support file appending.

There might be a bit of impedance to make them go together.

However, I think it should a very good initiative to come up with ideas to
be able to run postgres on distributed file system (doesn't have to be
specific hadoop).

Pi Song

On Sun, Feb 22, 2009 at 7:17 AM, Paul Sheer paulsh...@gmail.com wrote:

 Hadoop backend for PostGreSQL

 A problem that my client has, and one that I come across often,
 is that a database seems to always be associated with a particular
 physical machine, a physical machine that has to be upgraded,
 replaced, or otherwise maintained.

 Even if the database is replicated, it just means there are two or
 more machines. Replication is also a difficult thing to properly
 manage.

 With a distributed data store, the data would become a logical
 object - no adding or removal of machines would affect the data.
 This is an ideal that would remove a tremendous maintenance
 burden from many sites  well, at least the one's I have worked
 at as far as I can see.

 Does anyone know of plans to implement PostGreSQL over Hadoop?

 Yahoo seems to be doing this:

 http://glinden.blogspot.com/2008/05/yahoo-builds-two-petabyte-postgresql.html

 But they store tables column-ways for their performance situation.
 If one is doing a lot of inserts I don't think this is most efficient - ?

 Has Yahoo put the source code for their work online?

 Many thanks for any pointers.

 -paul

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hadoop backend?

Re: [HACKERS] Hadoop backend?

Re: [HACKERS] Hadoop backend?

Re: [HACKERS] Hadoop backend?

Re: [HACKERS] Hadoop backend?

Re: [HACKERS] Hadoop backend?

Re: [HACKERS] Hadoop backend?

Re: [HACKERS] Hadoop backend?

Re: [HACKERS] Hadoop backend?

Re: [HACKERS] Hadoop backend?

Re: [HACKERS] Hadoop backend?

Re: [HACKERS] Hadoop backend?

Re: [HACKERS] Hadoop backend?

Re: [HACKERS] Hadoop backend?

Re: [HACKERS] Hadoop backend?

Re: [HACKERS] Hadoop backend?

Re: [HACKERS] Hadoop backend?

[HACKERS] Hadoop backend?

Re: [HACKERS] Hadoop backend?

19 matches

Site Navigation

Mail list logo

Footer information