I may have expressed myself wrong. You don't need to do any test to see how locality works with files of multiple blocks. If you are accessing a file of more than one block over webhdfs, you only have assured locality for the first block of the file.
Thanks. On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <rnowl...@gmail.com> wrote: > Thank you, Mingjiang and Alejandro. > > This is interesting. Since we will use the data locality information for > scheduling, we could "hack" this to get the data locality information, at > least for the first block. As Alejandro says, we'd have to test what > happens for other data blocks -- e.g., what if, knowing the block sizes, we > request the second or third block? > > Interesting food for thought! I see some experiments in my future! > > Thanks! > > > On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur <t...@cloudera.com>wrote: > >> well, this is for the first block of the file, the rest of the file >> (blocks being local or not) are streamed out by the same datanode. for >> small files (one block) you'll get locality, for large files only the first >> block, and by chance if other blocks are local to that datanode. >> >> >> Alejandro >> (phone typing) >> >> On Mar 16, 2014, at 18:53, Mingjiang Shi <m...@gopivotal.com> wrote: >> >> According to this page: >> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/ >> >>> *Data Locality*: The file read and file write calls are redirected to >>> the corresponding datanodes. It uses the full bandwidth of the Hadoop >>> cluster for streaming data. >>> >>> *A HDFS Built-in Component*: WebHDFS is a first class built-in >>> component of HDFS. It runs inside Namenodes and Datanodes, therefore, it >>> can use all HDFS functionalities. It is a part of HDFS - there are no >>> additional servers to install >>> >> >> So it looks like the data locality is built-into webhdfs, client will be >> redirected to the data node automatically. >> >> >> >> >> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <rnowl...@gmail.com> wrote: >> >>> Hi all, >>> >>> I'm writing up a Google Summer of Code proposal to add HDFS support to >>> Disco, an Erlang MapReduce framework. >>> >>> We're interested in using WebHDFS. I have two questions: >>> >>> 1) Does WebHDFS allow querying data locality information? >>> >>> 2) If the data locality information is known, can data on specific data >>> nodes be accessed via Web HDFS? Or do all Web HDFS requests have to go >>> through a single server? >>> >>> Thanks, >>> RJ >>> >>> -- >>> em rnowl...@gmail.com >>> c 954.496.2314 >>> >> >> >> >> -- >> Cheers >> -MJ >> >> > > > -- > em rnowl...@gmail.com > c 954.496.2314 > -- Alejandro