Re: Data Locality and WebHDFS

2014-03-17 Thread Tsz Wo Sze
The file offset is considered in WebHDFS redirection.  It redirects to a 
datanode with the first block the client going to read, not the first block of 
the file.

Hope it helps.
Tsz-Wo



On Monday, March 17, 2014 10:09 AM, Alejandro Abdelnur t...@cloudera.com 
wrote:
 
actually, i am wrong, the webhdfs rest call has an offset. 

Alejandro
(phone typing)

On Mar 17, 2014, at 10:07, Alejandro Abdelnur t...@cloudera.com wrote:


dont recall how skips are handled in webhdfs, but i would assume that you'll 
get to the first block As usual, and the skip is handled by the DN serving the 
file (as webhdfs doesnot know at open that you'll skip)

Alejandro
(phone typing)

On Mar 17, 2014, at 9:47, RJ Nowling rnowl...@gmail.com wrote:


Hi Alejandro,


The WebHDFS API allows specifying an offset and length for the request.  If I 
specify an offset that start in the second block for a file (thus skipping 
the first block all together), will the namenode still direct me to a 
datanode with the first block or will it direct me to a namenode with the 
second block?  I.e., am I assured data locality only on the first block of 
the file (as you're saying) or on the first block I am accessing?


If it is as you say, then I may want to reach out the WebHDFS developers and 
see if they would be interested in the additional functionality.


Thank you,
RJ



On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur t...@cloudera.com wrote:

I may have expressed myself wrong. You don't need to do any test to see how 
locality works with files of multiple blocks. If you are accessing a file of 
more than one block over webhdfs, you only have assured locality for the 
first block of the file.


Thanks.



On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling rnowl...@gmail.com wrote:

Thank you, Mingjiang and Alejandro.


This is interesting.  Since we will use the data locality information for 
scheduling, we could hack this to get the data locality information, at 
least for the first block.  As Alejandro says, we'd have to test what 
happens for other data blocks -- e.g., what if, knowing the block sizes, we 
request the second or third block?


Interesting food for thought!  I see some experiments in my future!  


Thanks!



On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur t...@cloudera.com 
wrote:

well, this is for the first block of the file, the rest of the file (blocks 
being local or not) are streamed out by the same datanode. for small files 
(one block) you'll get locality, for large files only the first block, and 
by chance if other blocks are local to that datanode. 



Alejandro
(phone typing)

On Mar 16, 2014, at 18:53, Mingjiang Shi m...@gopivotal.com wrote:


According to this page: 
http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/

Data Locality: The file read and file write calls 
are redirected to the corresponding datanodes. It uses the full 
bandwidth of the Hadoop cluster for streaming data.
A HDFS Built-in Component: WebHDFS is a first class 
built-in component of HDFS. It runs inside Namenodes and Datanodes, 
therefore, it can use all HDFS functionalities. It is a part of HDFS – 
there are no additional servers to install

So it looks like the data locality is built-into webhdfs, client will be 
redirected to the data node automatically. 






On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling rnowl...@gmail.com wrote:

Hi all,


I'm writing up a Google Summer of Code proposal to add HDFS support to 
Disco, an Erlang MapReduce framework.  


We're interested in using WebHDFS.  I have two questions:


1) Does WebHDFS allow querying data locality information?


2) If the data locality information is known, can data on specific data 
nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go 
through a single server?

Thanks,
RJ


-- 
em rnowl...@gmail.com
c 954.496.2314 


-- 

Cheers
-MJ




-- 
em rnowl...@gmail.com
c 954.496.2314 



-- 
Alejandro 



-- 
em rnowl...@gmail.com
c 954.496.2314 



Re: Hoop into 0.23 release

2011-08-22 Thread Tsz Wo Sze
+1
I believe HDFS-2178 is very close to being committed.  Great work Alejandro!

Nicholas




From: Alejandro Abdelnur t...@cloudera.com
To: common-user@hadoop.apache.org; hdfs-...@hadoop.apache.org
Sent: Monday, August 22, 2011 2:16 PM
Subject: Hoop into 0.23 release

Hadoop developers,

Arun will be cutting a branch for Hadoop 0.23 as soon the trunk has a
successful build.

I'd like Hoop (https://issues.apache.org/jira/browse/HDFS-2178) to be part
of 0.23 (Nicholas already looked at the code).

In addition, the Jersey utils in Hoop will be handy for
https://issues.apache.org/jira/browse/MAPREDUCE-2863.

Most if not all of remaining work is not development but Java package
renaming (to be org.apache.hadoop..), Maven integration (sub-modules) and
final packaging.

The current blocker is to deciding on the Maven sub-modules organization,
https://issues.apache.org/jira/browse/HADOOP-7560

I'll drive the discussion for HADOOP-7560 and as soon as we have an
agreement there I'll refactor Hoop accordingly.

Does this sound reasonable ?

Thanks.

Alejandro

Re: DFSClient Protocol and FileSystem class

2011-07-31 Thread Tsz Wo Sze
Hi JD,

FileSystem is a public API but DFSClient is an internal class.  For developing 
Hadoop applications, we should use FileSystem.

Tsz-Wo




From: jagaran das jagaran_...@yahoo.co.in
To: common-user@hadoop.apache.org common-user@hadoop.apache.org
Sent: Sunday, July 31, 2011 2:50 PM
Subject: DFSClient Protocol and FileSystem class



What is the difference between DFSClient Protocol and FileSystem class in 
Hadoop DFS (HDFS). Both of these classes are used for connecting a remote 
client to the namenode in HDFS. So,

 I wanted to know the advantages of one over the other and which one is 
suitable for remote-client connection.


Regards,
JD

Re: ls command output format

2008-11-24 Thread Tsz Wo Sze
Filed HADOOP-4719 for this.

Nicholas Sze.




- Original Message 
 From: Tsz Wo (Nicholas), Sze [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Friday, November 21, 2008 7:54:27 AM
 Subject: Re: ls command output format
 
 Hi Alex,
 
 Yes, the doc about ls is out-dated.  Thanks for pointing this out.  Would you 
 mind to file a JIRA?
 
 Nicholas Sze
 
 
 
 - Original Message 
  From: Alexander Aristov 
  To: core-user@hadoop.apache.org
  Sent: Friday, November 21, 2008 6:08:08 AM
  Subject: Re: ls command output format
  
  Found out that output has been changed in 0.18
  
  see HADOOP-2865 
  
  Docs should be also then updated.
  
  Alex
  
  2008/11/21 Alexander Aristov 
  
   Hello
  
   I wonder if hadoop shell command ls has changed output format
  
   Trying hadoop-0.18.2 I got next output
  
   [root]# hadoop fs -ls /
   Found 2 items
   drwxr-xr-x   - root supergroup  0 2008-11-21 08:08 /mnt
   drwxr-xr-x   - root supergroup  0 2008-11-21 08:19 /repos
  
  
   Though according to docs it should be that file name goes first.
   http://hadoop.apache.org/core/docs/r0.18.2/hdfs_shell.html#ls
  
   Usage: hadoop fs -ls 
   For a file returns stat on the file with the following format:
   filename filesize modification_date modification_time
   permissions userid groupid
   For a directory it returns list of its direct children as in unix. A
   directory is listed as:
   dirname 
 modification_time modification_time permissions userid
   groupid
   Example:
   hadoop fs -ls /user/hadoop/file1 /user/hadoop/file2 hdfs://
   nn.example.com/user/hadoop/dir1 /nonexistentfile
   Exit Code:
Returns 0 on success and -1 on error.
  
  
   I wouldn't notice the issue if I haven't had scripts which rely on the
   formatting.
  
   --
   Best Regards
   Alexander Aristov
  
  
  
  
  -- 
  Best Regards
  Alexander Aristov