One more problem is that data placement on HDFS is inherent, meaning you have no explicit control. Thus, you cannot place two sets of data which are likely to be joined together on the same node = uncontrollable latency during query processing. Pi Song
On Mon, Feb 23, 2009 at 7:47 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Sat, Feb 21, 2009 at 9:37 PM, pi song <pi.so...@gmail.com> wrote: > > 1) Hadoop file system is very optimized for mostly read operation > > 2) As of a few months ago, hdfs doesn't support file appending. > > There might be a bit of impedance to make them go together. > > However, I think it should a very good initiative to come up with ideas > to > > be able to run postgres on distributed file system (doesn't have to be > > specific hadoop). > > In theory, I think you could make postgres work on any type of > underlying storage you like by writing a second smgr implementation > that would exist alongside md.c. The fly in the ointment is that > you'd need a more sophisticated implementation of this line of code, > from smgropen: > > reln->smgr_which = 0; /* we only have md.c at present */ > > Logically, it seems like the choice of smgr should track with the > notion of a tablespace. IOW, you might to have one tablespace that is > stored on a magnetic disk (md.c) and another that is stored on your > hypothetical distributed filesystem (hypodfs.c). I'm not sure how > hard this would be to implement, but I don't think smgropen() is in a > position to do syscache lookups, so probably not that easy. > > ...Robert >