This is the age old argument of what to share in a partitioned environment. IBM and Teradata have always used "shared nothing" which is what only getting one chunk of the file in each hadoop node is doing. Oracle has always used "shared disk" which is not an easy thing to do, especially in scale, and seems to have varying results depending on application, transaction or dss. Here are a couple of web references.
http://www.informatik.uni-trier.de/~ley/db/conf/vldb/Bhide88.html http://jhingran.typepad.com/anant_jhingrans_musings/2010/02/shared-nothi ng-vs-shared-disks-the-cloud-sequel.html Rather than say shared nothing isn't useful, hadoop should look to how others make this work. The two key problems to avoid are data skew where one node sees to much data and becomes the slow node and large intra-partition joins where large data is needed from more than one partition and potentially gets copied around. Rather than hybriding into shared disk, I think hadoop should hybrid into the shared data solutions others use, replication of select data, for solving intra-partition joins in a "shared nothing" architecture. This may be more database terminology that could be addressed by hbase, but I think it is good background for the questions of memory mapping files in hadoop. Kevin -----Original Message----- From: Ted Dunning [mailto:tdunn...@maprtech.com] Sent: Tuesday, April 12, 2011 12:09 AM To: Jason Rutherglen Cc: common-user@hadoop.apache.org; Edward Capriolo Subject: Re: Memory mapped resources Yes. But only one such block. That is what I meant by chunk. That is fine if you want that chunk but if you want to mmap the entire file, it isn't real useful. On Mon, Apr 11, 2011 at 6:48 PM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: > What do you mean by local chunk? I think it's providing access to the > underlying file block? > > On Mon, Apr 11, 2011 at 6:30 PM, Ted Dunning <tdunn...@maprtech.com> > wrote: > > Also, it only provides access to a local chunk of a file which isn't very > > useful. > > > > On Mon, Apr 11, 2011 at 5:32 PM, Edward Capriolo <edlinuxg...@gmail.com> > > wrote: > >> > >> On Mon, Apr 11, 2011 at 7:05 PM, Jason Rutherglen > >> <jason.rutherg...@gmail.com> wrote: > >> > Yes you can however it will require customization of HDFS. Take a > >> > look at HDFS-347 specifically the HDFS-347-branch-20-append.txt patch. > >> > I have been altering it for use with HBASE-3529. Note that the patch > >> > noted is for the -append branch which is mainly for HBase. > >> > > >> > On Mon, Apr 11, 2011 at 3:57 PM, Benson Margulies > >> > <bimargul...@gmail.com> wrote: > >> >> We have some very large files that we access via memory mapping in > >> >> Java. Someone's asked us about how to make this conveniently > >> >> deployable in Hadoop. If we tell them to put the files into hdfs, can > >> >> we obtain a File for the underlying file on any given node? > >> >> > >> > > >> > >> This features it not yet part of hadoop so doing this is not > "convenient". > > > > >