RE: hadoop/hdfs cache question, do client processes share cache?

2015-08-11 Thread Naganarasimha G R (Naga)
Hi Demai,
 centralized cache required 'explicit' configuration, so by default, there is 
no HDFS-managed cache?
YES only explicit centralized caching is supported, as the size of the data 
stored in HDFS will be too high and if multiple clients are accesing the 
Datanode then cache hit ratio will be very low, hence there is no point in 
having implicit HDFS-managed cache.

Will the cache occur at local filesystem level like Linux?
Please refer to dfs.datanode.drop.cache.behind.reads & 
dfs.datanode.drop.cache.behind.writes in 
http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
it gives more details about the OS buffer cache

the client has 10 processes repeatedly read the same HDFS file. will HDFS 
client API be able to cache the file content at Client side?
Each individual process will have its own HDFS Client so this caching needs to 
be done at the application layer.

or every READ will have to move the whole file through network, and no sharing  
between processes?
Yes, READ will have to move the whole file through network and no sharing 
between multiple clients/processes within a given node.

+ Naga

From: Demai Ni [nid...@gmail.com]
Sent: Wednesday, August 12, 2015 02:05
To: user@hadoop.apache.org
Subject: Re: hadoop/hdfs cache question, do client processes share cache?

Ritesh,

many thanks for your response. I just read through the centralized Cache 
document. Thanks for the pointer. A couple follow-up questions.

First, the centralized cache required 'explicit' configuration, so by default, 
there is no HDFS-managed cache? Will the cache occur at local filesystem level 
like Linux?

The 2nd question. The centralized Cache is among the DN of HDFS. Let's say the 
client is a stand-alone Linux(not part of the cluster), which connects to the 
HDFS cluster with centralized cache configured. So on HDFS cluster, the file is 
cached. In the same scenario, the client has 10 processes repeatedly read the 
same HDFS file. will HDFS client API be able to cache the file content at 
Client side? or every READ will have to move the whole file through network, 
and no sharing  between processes?

Demai


On Tue, Aug 11, 2015 at 12:58 PM, Ritesh Kumar Singh 
mailto:riteshoneinamill...@gmail.com>> wrote:
Let's assume that hdfs maintains 3 replicas of the 256MB block, then all of 
these 3 datanodes will have only one copy of the block in their respective mem 
cache and thus avoiding the repeated i/o reads. This goes with the centralized 
cache management policy of hdfs that also gives you an option to pin 2 of these 
3 blocks in cache and save the remaining 256MB of cache space. Here's a 
link
 on the same.

Hope that helps.

Ritesh



RE: Remote Yarn Client

2015-08-11 Thread Naganarasimha G R (Naga)
Hi Istrabak,
If the installation package similar to the server is copied to client and 
configurations(/etc/hadoop) of HDFS, YARN & MAPReduce are also 
copied in the client machine, then you should be able to use the CLI from the 
remote machine.
It basically uses RPC over protobuf to remotely connect to the Serve. Also you 
can think of using REST if you don't want to have the whole installation folder 
in the client machine. Another alternative is to copy the specific client jars 
which is little cumbersome.

If more description about what you want to do remotely then could suggest more 
precisely.

+ Naga

From: Istabrak Abdul-Fatah [ifa...@gmail.com]
Sent: Wednesday, August 12, 2015 03:29
To: user@hadoop.apache.org
Subject: Remote Yarn Client

Greetings to all,

In my project, the yarn client is located on a different node.
Does Yarn support remote clients connections? and if so, what are the supported 
protocols?
Could you please provide some details?
Code snippets would be appreciated.

Thx & BR

Ista


Remote Yarn Client

2015-08-11 Thread Istabrak Abdul-Fatah
Greetings to all,

In my project, the yarn client is located on a different node.
Does Yarn support remote clients connections? and if so, what are the
supported protocols?
Could you please provide some details?
Code snippets would be appreciated.

Thx & BR

Ista


Re: hadoop/hdfs cache question, do client processes share cache?

2015-08-11 Thread Demai Ni
Ritesh,

many thanks for your response. I just read through the centralized Cache
document. Thanks for the pointer. A couple follow-up questions.

First, the centralized cache required 'explicit' configuration, so by
default, there is no HDFS-managed cache? Will the cache occur at local
filesystem level like Linux?

The 2nd question. The centralized Cache is among the DN of HDFS. Let's say
the client is a stand-alone Linux(not part of the cluster), which connects
to the HDFS cluster with centralized cache configured. So on HDFS cluster,
the file is cached. In the same scenario, the client has 10 processes
repeatedly read the same HDFS file. will HDFS client API be able to cache
the file content at Client side? or every READ will have to move the whole
file through network, and no sharing  between processes?

Demai


On Tue, Aug 11, 2015 at 12:58 PM, Ritesh Kumar Singh <
riteshoneinamill...@gmail.com> wrote:

> Let's assume that hdfs maintains 3 replicas of the 256MB block, then all
> of these 3 datanodes will have only one copy of the block in their
> respective mem cache and thus avoiding the repeated i/o reads. This goes
> with the centralized cache management policy of hdfs that also gives you an
> option to pin 2 of these 3 blocks in cache and save the remaining 256MB of
> cache space. Here's a link
> 
>  on
> the same.
>
> Hope that helps.
>
> Ritesh
>


Re: hadoop/hdfs cache question, do client processes share cache?

2015-08-11 Thread Ritesh Kumar Singh
Let's assume that hdfs maintains 3 replicas of the 256MB block, then all of
these 3 datanodes will have only one copy of the block in their respective
mem cache and thus avoiding the repeated i/o reads. This goes with the
centralized cache management policy of hdfs that also gives you an option
to pin 2 of these 3 blocks in cache and save the remaining 256MB of cache
space. Here's a link

on
the same.

Hope that helps.

Ritesh


Issues and Questions on running Hadoop 2.7 Yarn MapReduce examples

2015-08-11 Thread Istabrak Abdul-Fatah
Greetings to all,
I have installed and configured Hadoop2.7.0 on a linux VM. Then I
successfully ran the pre-compiled/packaged examples (e.g. PI, WordCount
etc..).
I also downloaded the Hadoop2.7.0 source code and created an eclipse
project.
I exported the WordCount jar file and tried to run the example from the
command line as follows:

> yarn jar /opt/yarn/my_examples/WordCount.jar
/user/yarn/input/wordcount.txt output


Q1: When I used the default WordCount implementation (shown in listing1),
it failed with a list of exceptions and suggestions to use the Tool and
toolRunner interfaces/utils (see errorListing1).
  I updated the code (see listing2) and included these suggested utils
and it ran successfully.
 Could you please provide an explanation as to why the application
failed to run in the first attempt and the necessity to use the
Tool/tooRunner utils?


Q2: Does this example create a yarn client implicitly and interacts with
the Yarn layer? if not, then could you please explain how the application
interacted with the hdfs layer given that the
  Yarn later is in between?


Thx and BR

Ista
[hdfs@caotclc04881 ~]$ yarn jar /opt/yarn/my_examples/WordCount.jar 
/user/yarn/input/wordcount.txt output
15/08/05 13:59:32 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
15/08/05 13:59:33 INFO client.RMProxy: Connecting to ResourceManager at 
/0.0.0.0:8032
15/08/05 13:59:33 WARN mapreduce.JobResourceUploader: Hadoop command-line 
option parsing not performed. Implement the Tool interface and execute your 
application with ToolRunner to remedy this.
15/08/05 13:59:33 WARN mapreduce.JobResourceUploader: No job jar file set.  
User classes may not be found. See Job or Job#setJar(String).
15/08/05 13:59:33 INFO input.FileInputFormat: Total input paths to process : 1
15/08/05 13:59:33 INFO mapreduce.JobSubmitter: number of splits:1
15/08/05 13:59:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
job_1437148602144_0005
15/08/05 13:59:34 INFO mapred.YARNRunner: Job jar is not present. Not adding 
any jar to the list of resources.
15/08/05 13:59:34 INFO impl.YarnClientImpl: Submitted application 
application_1437148602144_0005
15/08/05 13:59:34 INFO mapreduce.Job: The url to track the job: 
http://caotclc04881:8088/proxy/application_1437148602144_0005/
15/08/05 13:59:34 INFO mapreduce.Job: Running job: job_1437148602144_0005
15/08/05 13:59:40 INFO mapreduce.Job: Job job_1437148602144_0005 running in 
uber mode : false
15/08/05 13:59:40 INFO mapreduce.Job:  map 0% reduce 0%
15/08/05 13:59:43 INFO mapreduce.Job: Task Id : 
attempt_1437148602144_0005_m_00_0, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
WordCount$Map not found
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at 
org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:745)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class WordCount$Map not found
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 8 more

15/08/05 13:59:47 INFO mapreduce.Job: Task Id : 
attempt_1437148602144_0005_m_00_1, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
WordCount$Map not found
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at 
org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:745)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class WordCount$Map not found
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 8 more

15/08/05 13:59:51 INFO mapreduce.Job: Task Id : 
attempt

hadoop/hdfs cache question, do client processes share cache?

2015-08-11 Thread Demai Ni
hi, folks,

I have a quick question about how hdfs handle cache? In this lab
experiment, I have a 4 node hadoop cluster (2.x) and each node has a fair
large memory (96GB).  And have a single hdfs file with 256MB, and also fit
in one HDFS block. The local filesystem is linux.

Now from one of the DataNode, I started 10 hadoop client processes to
repeatedly read the above file. With the assumption that HDFS will cache
the 256MB in memory, so (after the 1st read) READs will have no disk I/O
involved anymore.

My question is : *how many COPY of the 256MB will be in memory of this
DataNode? 10 or 1?*

How about the 10 client processes are located at the 5th linux box
 independent of the cluster? Will we have 10 copies of the 256 MB or just
1?

Many thanks. Appreciate your help on this.

Demai


Issues and Questions on running Hadoop 2.7 Yarn MapReduce examples

2015-08-11 Thread Istabrak Abdul-Fatah
Greetings to all,
I have installed and configured Hadoop2.7.0 on a linux VM. Then I
successfully ran the pre-compiled/packaged examples (e.g. PI, WordCount
etc..).
I also downloaded the Hadoop2.7.0 source code and created an eclipse
project.
I exported the WordCount jar file and tried to run the example from the
command line as follows:

> yarn jar /opt/yarn/my_examples/WordCount.jar
/user/yarn/input/wordcount.txt output


Q1: When I used the default WordCount implementation (shown in listing1),
it failed with a list of exceptions and suggestions to use the Tool and
toolRunner interfaces/utils (see errorListing1).
  I updated the code (see listing2) and included these suggested utils
and it ran successfully.
 Could you please provide an explanation as to why the application
failed to run in the first attempt and the necessity to use the
Tool/tooRunner utils?


Q2: Does this example create a yarn client implicitly and interacts with
the Yarn layer? if not, then could you please explain how the application
interacted with the hdfs layer given that the
  Yarn later is in between?


Thx and BR

Ista
[hdfs@caotclc04881 ~]$ yarn jar /opt/yarn/my_examples/WordCount.jar 
/user/yarn/input/wordcount.txt output
15/08/05 13:59:32 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
15/08/05 13:59:33 INFO client.RMProxy: Connecting to ResourceManager at 
/0.0.0.0:8032
15/08/05 13:59:33 WARN mapreduce.JobResourceUploader: Hadoop command-line 
option parsing not performed. Implement the Tool interface and execute your 
application with ToolRunner to remedy this.
15/08/05 13:59:33 WARN mapreduce.JobResourceUploader: No job jar file set.  
User classes may not be found. See Job or Job#setJar(String).
15/08/05 13:59:33 INFO input.FileInputFormat: Total input paths to process : 1
15/08/05 13:59:33 INFO mapreduce.JobSubmitter: number of splits:1
15/08/05 13:59:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
job_1437148602144_0005
15/08/05 13:59:34 INFO mapred.YARNRunner: Job jar is not present. Not adding 
any jar to the list of resources.
15/08/05 13:59:34 INFO impl.YarnClientImpl: Submitted application 
application_1437148602144_0005
15/08/05 13:59:34 INFO mapreduce.Job: The url to track the job: 
http://caotclc04881:8088/proxy/application_1437148602144_0005/
15/08/05 13:59:34 INFO mapreduce.Job: Running job: job_1437148602144_0005
15/08/05 13:59:40 INFO mapreduce.Job: Job job_1437148602144_0005 running in 
uber mode : false
15/08/05 13:59:40 INFO mapreduce.Job:  map 0% reduce 0%
15/08/05 13:59:43 INFO mapreduce.Job: Task Id : 
attempt_1437148602144_0005_m_00_0, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
WordCount$Map not found
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at 
org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:745)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class WordCount$Map not found
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 8 more

15/08/05 13:59:47 INFO mapreduce.Job: Task Id : 
attempt_1437148602144_0005_m_00_1, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
WordCount$Map not found
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at 
org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:745)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class WordCount$Map not found
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 8 more

15/08/05 13:59:51 INFO mapreduce.Job: Task Id : 
attempt

Heterogeneous

2015-08-11 Thread ????
Is the api of ??Heterogeneous Storages in HDFS ok?? ?