Re: Warnings?

2013-04-26 Thread Ted Xu
Hi Kevin,

Please see my comments inline,


On Sat, Apr 27, 2013 at 11:24 AM, Kevin Burton wrote:

> Is the native library not available for Ubuntu? If so how do I load it?
>
Native libraries usually requires recompile, for more information please
refer Native 
Libraries
.


> Can I tell which key is off? Since I am just starting I would want to be
> as up to date as possible. It is out of date probably because I copied my
> examples from books and tutorials.
>
> I think the warning messages are telling it already, "xxx is deprecated,
use xxx instead...". In fact, most of the configure keys are changed from
hadoop 1.x to 2.x. The compatibility change may later documented on
http://wiki.apache.org/hadoop/Compatibility.


> The main class does derive from Tool. Should I ignore this warning as it
> seems to be in error?
>
Of course you can ignore this warning as long as you don't use hadoop
generic options.


>
> Thank you.
>
> On Apr 26, 2013, at 7:49 PM, Ted Xu  wrote:
>
> Hi,
>
> First warning is saying hadoop cannot load native library, usually a
> compression codec. In that case, hadoop will use java implementation
> instead, which is slower.
>
> Second is caused by hadoop 1.x/2.x configuration key change. You're using
> a 1.x style key under 2.x, yet hadoop still guarantees backward
> compatibility.
>
> Third is saying that the main class of a hadoop application is recommanded
> to implement 
> org.apache.hadoop.util.Tool,
> or else generic command line options (e.g., -D options) will not supported.
>
>
>
> On Sat, Apr 27, 2013 at 5:51 AM,  wrote:
>
>> I am running a simple WordCount m/r job and I get output but I get five
>> warnings that I am not sure if I should pay attention to:
>>
>> *13/04/26 16:24:50 WARN util.NativeCodeLoader: Unable to load
>> native-hadoop library for your platform... using builtin-java classes where
>> applicable *
>>
>> *13/04/26 16:24:50 WARN conf.Configuration: session.id is deprecated.
>> Instead, use dfs.metrics.session-id *
>>
>> *13/04/26 16:24:50 WARN mapred.JobClient: Use GenericOptionsParser for
>> parsing the arguments. Applications should implement Tool for the same. *
>>
>> *13/04/26 16:24:51 WARN mapreduce.Counters: Group
>> org.apache.hadoop.mapred.Task$Counter is deprecated. Use
>> org.apache.hadoop.mapreduce.TaskCounter instead *
>>
>> *13/04/26 16:24:51 WARN mapreduce.Counters: Counter name MAP_INPUT_BYTES
>> is deprecated. Use FileInputFormatCounters as group name and  BYTES_READ as
>> counter name instead*
>>
>> Any ideas on what these mean? The only one that I can see in the code is
>> the third one. I am using GenericOptionsParser as it is part of an example
>> that I copied. But I don't know why this is considered bad.
>>
>> Thank you.
>>
>>
>
>
> --
> Regards,
> Ted Xu
>
>


-- 
Regards,
Ted Xu


Re: Warnings?

2013-04-26 Thread Kevin Burton
Is the native library not available for Ubuntu? If so how do I load it?

Can I tell which key is off? Since I am just starting I would want to be as up 
to date as possible. It is out of date probably because I copied my examples 
from books and tutorials.

The main class does derive from Tool. Should I ignore this warning as it seems 
to be in error?

Thank you.

On Apr 26, 2013, at 7:49 PM, Ted Xu  wrote:

> Hi,
> 
> First warning is saying hadoop cannot load native library, usually a 
> compression codec. In that case, hadoop will use java implementation instead, 
> which is slower.
> 
> Second is caused by hadoop 1.x/2.x configuration key change. You're using a 
> 1.x style key under 2.x, yet hadoop still guarantees backward compatibility.
> 
> Third is saying that the main class of a hadoop application is recommanded to 
> implement org.apache.hadoop.util.Tool, or else generic command line options 
> (e.g., -D options) will not supported.   
> 
> 
> On Sat, Apr 27, 2013 at 5:51 AM,  wrote:
>> I am running a simple WordCount m/r job and I get output but I get five 
>> warnings that I am not sure if I should pay attention to:
>> 
>> 13/04/26 16:24:50 WARN util.NativeCodeLoader: Unable to load native-hadoop 
>> library for your platform... using builtin-java classes where applicable
>> 
>> 13/04/26 16:24:50 WARN conf.Configuration: session.id is deprecated. 
>> Instead, use dfs.metrics.session-id
>> 
>> 13/04/26 16:24:50 WARN mapred.JobClient: Use GenericOptionsParser for 
>> parsing the arguments. Applications should implement Tool for the same.
>> 
>> 13/04/26 16:24:51 WARN mapreduce.Counters: Group 
>> org.apache.hadoop.mapred.Task$Counter is deprecated. Use 
>> org.apache.hadoop.mapreduce.TaskCounter instead
>> 
>> 13/04/26 16:24:51 WARN mapreduce.Counters: Counter name MAP_INPUT_BYTES is 
>> deprecated. Use FileInputFormatCounters as group name and  BYTES_READ as 
>> counter name instead
>> 
>> Any ideas on what these mean? The only one that I can see in the code is the 
>> third one. I am using GenericOptionsParser as it is part of an example that 
>> I copied. But I don't know why this is considered bad.
>> 
>> Thank you.
> 
> 
> 
> -- 
> Regards,
> Ted Xu


Re: Warnings?

2013-04-26 Thread Ted Xu
Hi,

First warning is saying hadoop cannot load native library, usually a
compression codec. In that case, hadoop will use java implementation
instead, which is slower.

Second is caused by hadoop 1.x/2.x configuration key change. You're using a
1.x style key under 2.x, yet hadoop still guarantees backward compatibility.

Third is saying that the main class of a hadoop application is recommanded
to implement 
org.apache.hadoop.util.Tool,
or else generic command line options (e.g., -D options) will not supported.



On Sat, Apr 27, 2013 at 5:51 AM,  wrote:

> I am running a simple WordCount m/r job and I get output but I get five
> warnings that I am not sure if I should pay attention to:
>
> *13/04/26 16:24:50 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable *
>
> *13/04/26 16:24:50 WARN conf.Configuration: session.id is deprecated.
> Instead, use dfs.metrics.session-id *
>
> *13/04/26 16:24:50 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same. *
>
> *13/04/26 16:24:51 WARN mapreduce.Counters: Group
> org.apache.hadoop.mapred.Task$Counter is deprecated. Use
> org.apache.hadoop.mapreduce.TaskCounter instead *
>
> *13/04/26 16:24:51 WARN mapreduce.Counters: Counter name MAP_INPUT_BYTES
> is deprecated. Use FileInputFormatCounters as group name and  BYTES_READ as
> counter name instead*
>
> Any ideas on what these mean? The only one that I can see in the code is
> the third one. I am using GenericOptionsParser as it is part of an example
> that I copied. But I don't know why this is considered bad.
>
> Thank you.
>
>


-- 
Regards,
Ted Xu


RE: M/R Staticstics

2013-04-26 Thread Kevin Burton
Answers below.

 

From: Omkar Joshi [mailto:ojo...@hortonworks.com] 
Sent: Friday, April 26, 2013 7:15 PM
To: user@hadoop.apache.org
Subject: Re: M/R Staticstics

 

Have you enabled security?

No

 

can you share the output for your hdfs?

 

bin/hadoop fs -ls /

 

kevin@devUbuntu05:~$ hadoop fs -ls /

Found 2 items

drwxrwxrwx   - hdfs supergroup  0 2013-04-26 13:33 /tmp

drwxr-xr-x   - hdfs supergroup  0 2013-04-19 16:40 /user

 

and is /tmp/hadoop-yarn/staging/history/done directory present in hdfs ? if
so then what permissions?

 

kevin@devUbuntu05:~$ hadoop fs -ls -R /tmp

drwxrwx---   - mapred supergroup  0 2013-04-26 13:33
/tmp/hadoop-yarn

ls: Permission denied: user=kevin, access=READ_EXECUTE,
inode="/tmp/hadoop-yarn":mapred:supergroup:drwxrwx---

 

 

kevin@devUbuntu05:~$ sudo -u hdfs hadoop fs -ls -R /tmp

[sudo] password for kevin:

drwxrwx---   - mapred supergroup  0 2013-04-26 13:33
/tmp/hadoop-yarn

drwxrwx---   - mapred supergroup  0 2013-04-26 13:33
/tmp/hadoop-yarn/staging

drwxrwx---   - mapred supergroup  0 2013-04-26 13:33
/tmp/hadoop-yarn/staging/history

drwxrwx---   - mapred supergroup  0 2013-04-26 13:33
/tmp/hadoop-yarn/staging/history/done

drwxrwxrwt   - mapred supergroup  0 2013-04-26 13:33
/tmp/hadoop-yarn/staging/history/done_intermediate

kevin@devUbuntu05:~$

 

also please share exception stack trace...

 

There is no exception now that I created /tmp on HDFS. But I still cannot
see the logs via port 50030 on the master. In other words nothing seems to
be listening on http:devubuntu05:50030. The log for map reduce looks like:

 

2013-04-26 13:35:26,107 INFO
org.apache.hadoop.mapreduce.v2.hs.HistoryClientService: Instantiated
MRClientService at devUbuntu05/172.16.26.68:10020

2013-04-26 13:35:26,107 INFO org.apache.hadoop.yarn.service.AbstractService:
Service:HistoryClientService is started.

2013-04-26 13:35:26,107 INFO org.apache.hadoop.yarn.service.AbstractService:
Service:org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer is started.

2013-04-26 13:35:55,290 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
History Cleaner started

2013-04-26 13:35:55,295 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
History Cleaner complete

2013-04-26 13:38:25,283 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
Starting scan to move intermediate done files

2013-04-26 13:41:25,283 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
Starting scan to move intermediate done files

2013-04-26 13:44:25,283 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
Starting scan to move intermediate done files

2013-04-26 13:47:25,283 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
Starting scan to move intermediate done files

2013-04-26 13:50:25,283 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory:
Starting scan to move intermediate done files

 

 

Thanks,

Omkar Joshi

Hortonworks Inc

 

On Fri, Apr 26, 2013 at 3:05 PM,  wrote:

  

I was able to overcome the permission exception in the log by creating an
HDFS tmp folder (hadoop fs -mkdir /tmp) and opening it up to the world
(hadoop fs -chmod a+rwx /tmp). That got rid of the exception put I still am
able to connect to port 50030 to see M/R status. More ideas?

 

Even though the exception was missing from the logs of one server in the
cluster, l looked on another server and found essentially the same
permission problem:

 

2013-04-26 13:34:56,462 FATAL
org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer: Error starting
JobHistoryServer

org.apache.hadoop.yarn.YarnException: Error creating done directory:
[hdfs://devubuntu05:9000/tmp/hadoop-yarn/staging/history/done]

at
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.init(HistoryFileManager
.java:424)

at
org.apache.hadoop.mapreduce.v2.hs.JobHistory.init(JobHistory.java:87)

at
org.apache.hadoop.yarn.service.CompositeService.init(CompositeService.java:5
8)

 

. . . . .

 

On Fri, Apr 26, 2013 at 10:37 AM, Rishi Yadav wrote: 

 

  do you see "retired jobs" on job tracker page. There is also "job tracker
history" on the bottom of page.  

 

something like this  http://nn.zettabyte.com:50030/jobtracker.jsp

Thanks and Regards, 

Rishi Yadav 







On Fri, Apr 26, 2013 at 7:36 AM, < rkevinbur...@charter.net> wrote: 

When I submit a simple "Hello World" M/R job like WordCount it takes less
than 5 seconds. The texts show numerous methods for monitoring M/R jobs as
they are happening but I have yet to see any that show statistics about a
job after it has completed. Obviously simple jobs that take a short amount
of time don't allow time to fire up any web mage or monitoring tool to see
how it progresses through the JobTracker and TaskTracker as well as which
node it is processed on. Any suggestions on how could see this kind of data
*after* a job has completed? 

 

 



Re: Jobtracker memory issues due to FileSystem$Cache

2013-04-26 Thread agile.j...@gmail.com
We meet the same problem, I haven't found the reason,I'm debugging it.


On Wed, Apr 17, 2013 at 11:14 PM, Marcin Mejran  wrote:

>  In case anyone is wondering, I tracked this down to a race condition in
> JobInProgress or failure to clean up FileSystems in CleanupQueue (depending
> on how you look at it). 
>
> ** **
>
> FileSystem.closeAllForUGI is what keeps the cache from memory leaking
> however it’s not called in one thread. However JobInProgress calls
> closeAllForUGI  on a UGI that was also passed to the CleanupQueue thread.
> If closeAllForUGI is called by JobInProgress before CleanupQueue calls
> FileSystem.get with that ugi then there’s a leak. Since CleanupQueue
> doesn’t call closeAllForUGI the filesystem is left cached perpetually.
>
> ** **
>
> Setting, for example, keep.failed.task.files=true or
> keep.task.files.pattern= prevents CleanupQueue from getting
> called which seems to solve my issues. You get junk left in .staging but
> that can be dealt with.
>
> ** **
>
> -Marcin
>
> ** **
>
> *From:* Marcin Mejran [mailto:marcin.mej...@hooklogic.com]
> *Sent:* Tuesday, April 16, 2013 1:47 PM
> *To:* user@hadoop.apache.org
> *Subject:* Jobtracker memory issues due to FileSystem$Cache
>
> ** **
>
> We’ve recently run into jobtracker memory issues on our new hadoop
> cluster. A heap dump shows that there are thousands of copies of
> DistributedFileSystem kept in FileSystem$Cache, a bit over one for each job
> run on the cluster and their jobconf objects support this view. I believe
> these are created when the .staging directories get cleaned up but I may be
> wrong on that.
>
> ** **
>
> From what I can tell in the dump, the username (probably not ugi, hard to
> tell), scheme and authority parts of the Cache$Key are the same across
> multiple objects in FileSystem$Cache. I can only assume that the
> usergroupinformation piece differs somehow every time it’s created.
>
> ** **
>
> We’re using CDH4.2, MR1, CentOS 6.3 and Java 1.6_31. Kerberos, ldap and so
> on are not enabled. 
>
> ** **
>
> Is there any known reason for this type of behavior?
>
> ** **
>
> Thanks,
>
> -Marcin
>



-- 
d0ngd0ng


Re: M/R Staticstics

2013-04-26 Thread Omkar Joshi
Have you enabled security?

can you share the output for your hdfs?

bin/hadoop fs -ls /

and is /tmp/hadoop-yarn/staging/history/done directory present in hdfs ? if
so then what permissions?

also please share exception stack trace...

Thanks,
Omkar Joshi
Hortonworks Inc


On Fri, Apr 26, 2013 at 3:05 PM,  wrote:

>
> I was able to overcome the permission exception in the log by creating an
> HDFS tmp folder (hadoop fs -mkdir /tmp) and opening it up to the world
> (hadoop fs -chmod a+rwx /tmp). That got rid of the exception put I still am
> able to connect to port 50030 to see M/R status. More ideas?
>
> Even though the exception was missing from the logs of one server in the
> cluster, l looked on another server and found essentially the same
> permission problem:
>
> 2013-04-26 13:34:56,462 FATAL
> org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer: Error starting
> JobHistoryServer
> org.apache.hadoop.yarn.YarnException: Error creating done directory:
> [hdfs://devubuntu05:9000/tmp/hadoop-yarn/staging/history/done]
> at
> org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.init(HistoryFileManager.java:424)
> at
> org.apache.hadoop.mapreduce.v2.hs.JobHistory.init(JobHistory.java:87)
> at
> org.apache.hadoop.yarn.service.CompositeService.init(CompositeService.java:58)
>
> . . . . .
>
> On Fri, Apr 26, 2013 at 10:37 AM, Rishi Yadav wrote:
>
>   do you see "retired jobs" on job tracker page. There is also "job
> tracker history" on the bottom of page.
>
> something like this  
> *http://nn.zettabyte.com:50030/jobtracker.jsp*
> Thanks and Regards,
> Rishi Yadav
>
>
>
>
>
> On Fri, Apr 26, 2013 at 7:36 AM, < *rkevinbur...@charter.net*> wrote:
> When I submit a simple "Hello World" M/R job like WordCount it takes less
> than 5 seconds. The texts show numerous methods for monitoring M/R jobs as
> they are happening but I have yet to see any that show statistics about a
> job after it has completed. Obviously simple jobs that take a short amount
> of time don't allow time to fire up any web mage or monitoring tool to see
> how it progresses through the JobTracker and TaskTracker as well as which
> node it is processed on. Any suggestions on how could see this kind of data
> *after* a job has completed?
>
>


Re: M/R Staticstics

2013-04-26 Thread rkevinburton



I was able to overcome the permission exception in the log by creating 
an HDFS tmp folder (hadoop fs -mkdir /tmp) and opening it up to the 
world (hadoop fs -chmod a+rwx /tmp). That got rid of the exception put I 
still am able to connect to port 50030 to see M/R status. More ideas?


Even though the exception was missing from the logs of one server in the 
cluster, l looked on another server and found essentially the same 
permission problem:


2013-04-26 13:34:56,462 FATAL 
org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer: Error starting 
JobHistoryServer
org.apache.hadoop.yarn.YarnException: Error creating done directory: 
[hdfs://devubuntu05:9000/tmp/hadoop-yarn/staging/history/done]
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.init(HistoryFileManager.java:424)
at 
org.apache.hadoop.mapreduce.v2.hs.JobHistory.init(JobHistory.java:87)
at 
org.apache.hadoop.yarn.service.CompositeService.init(CompositeService.java:58)


. . . . .

On Fri, Apr 26, 2013 at 10:37 AM, Rishi Yadav wrote:

  do you see "retired jobs" on job tracker page. There is also "job 
tracker history" on the bottom of page. 


something like this  http://nn.zettabyte.com:50030/jobtracker.jsp 


Thanks and Regards,
Rishi Yadav



On Fri, Apr 26, 2013 at 7:36 AM, < rkevinbur...@charter.net 
 

wrote:
When I submit a simple "Hello World" M/R job like WordCount it takes 
less than 5 seconds. The texts show numerous methods for monitoring M/R 
jobs as they are happening but I have yet to see any that show 
statistics about a job after it has completed. Obviously simple jobs 
that take a short amount of time don't allow time to fire up any web 
mage or monitoring tool to see how it progresses through the JobTracker 
and TaskTracker as well as which node it is processed on. Any 
suggestions on how could see this kind of data *after* a job has 
completed?


Warnings?

2013-04-26 Thread rkevinburton


I am running a simple WordCount m/r job and I get output but I get five 
warnings that I am not sure if I should pay attention to:


13/04/26 16:24:50 WARN util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes 
where applicable


13/04/26 16:24:50 WARN conf.Configuration: session.id is deprecated. 
Instead, use dfs.metrics.session-id


13/04/26 16:24:50 WARN mapred.JobClient: Use GenericOptionsParser for 
parsing the arguments. Applications should implement Tool for the same.


13/04/26 16:24:51 WARN mapreduce.Counters: Group 
org.apache.hadoop.mapred.Task$Counter is deprecated. Use 
org.apache.hadoop.mapreduce.TaskCounter instead


13/04/26 16:24:51 WARN mapreduce.Counters: Counter name MAP_INPUT_BYTES 
is deprecated. Use FileInputFormatCounters as group name and  BYTES_READ 
as counter name instead


Any ideas on what these mean? The only one that I can see in the code is 
the third one. I am using GenericOptionsParser as it is part of an 
example that I copied. But I don't know why this is considered bad.


Thank you.


Warnings?

2013-04-26 Thread rkevinburton


I am running a simple WordCount m/r job and I get output but I get three 
warnings that I am not sure if I should pay attention to:


13/04/26 16:24:50 WARN util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes 
where applicable


13/04/26 16:24:50 WARN conf.Configuration: session.id is deprecated. 
Instead, use dfs.metrics.session-id


13/04/26 16:24:50 WARN mapred.JobClient: Use GenericOptionsParser for 
parsing the arguments. Applications should implement Tool for the same.


13/04/26 16:24:51 WARN mapreduce.Counters: Group 
org.apache.hadoop.mapred.Task$Counter is deprecated. Use 
org.apache.hadoop.mapreduce.TaskCounter instead


Any ideas on what these mean? The only one that I can see in the code is 
the last one. I am using GenericOptionsParser as it is part of an 
example that I copied. But I don't know why this is considered bad.


Thank you.


Warnings?

2013-04-26 Thread rkevinburton


I am running a simple WordCount m/r job and I get output but I get three 
warnings that I am not sure if I should pay attention to:


13/04/26 16:24:50 WARN util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes 
where applicable


13/04/26 16:24:50 WARN conf.Configuration: session.id is deprecated. 
Instead, use dfs.metrics.session-id


13/04/26 16:24:50 WARN mapred.JobClient: Use GenericOptionsParser for 
parsing the arguments. Applications should implement Tool for the same.


Any ideas on what these mean? The only one that I can see in the code is 
the last one. I am using GenericOptionsParser as it is part of an 
example that I copied. But I don't know why this is considered bad.


Thank you.


Re: M/R job to a cluster?

2013-04-26 Thread Kevin Burton
It is hdfs://devubuntu05:9000. Is this wrong? Devubuntu05 is the name of the 
host where the NameNode and JobTracker should be running. It is also the host 
where I am running the M/R client code.

On Apr 26, 2013, at 4:06 PM, Rishi Yadav  wrote:

> check core-site.xml and see value of fs.default.name. if it has localhost you 
> are running locally.
> 
> 
> 
> 
> On Fri, Apr 26, 2013 at 1:59 PM,  wrote:
>> I suspect that my MapReduce job is being run locally. I don't have any 
>> evidence but I am not sure how the specifics of my configuration are 
>> communicated to the Java code that I write. Based on the text that I have 
>> read online basically I start with code like:
>> 
>> JobClient client = new JobClient();
>> JobConf conf - new JobConf(WordCount.class);
>> . . . . .
>> 
>> Where do I communicate the configuration information so that the M/R job 
>> runs on the cluster and not locally? Or is the configuration location 
>> "magically determined"?
>> 
>> Thank you.
> 


Re: M/R job to a cluster?

2013-04-26 Thread Rishi Yadav
check core-site.xml and see value of fs.default.name. if it has localhost
you are running locally.




On Fri, Apr 26, 2013 at 1:59 PM,  wrote:

> I suspect that my MapReduce job is being run locally. I don't have any
> evidence but I am not sure how the specifics of my configuration are
> communicated to the Java code that I write. Based on the text that I have
> read online basically I start with code like:
>
> JobClient client = new JobClient();
> JobConf conf - new JobConf(WordCount.class);
> . . . . .
>
> Where do I communicate the configuration information so that the M/R job
> runs on the cluster and not locally? Or is the configuration location
> "magically determined"?
>
> Thank you.
>


M/R job to a cluster?

2013-04-26 Thread rkevinburton


I suspect that my MapReduce job is being run locally. I don't have any 
evidence but I am not sure how the specifics of my configuration are 
communicated to the Java code that I write. Based on the text that I 
have read online basically I start with code like:


JobClient client = new JobClient();
JobConf conf - new JobConf(WordCount.class);
. . . . .

Where do I communicate the configuration information so that the M/R job 
runs on the cluster and not locally? Or is the configuration location 
"magically determined"?


Thank you.


Exception during reduce phase when running jobs remotely

2013-04-26 Thread Oren Bumgarner
I have a small hadoop cluster running 1.0.4 and I'm trying to have it setup
so that I can run jobs remotely from a computer on the same network that is
not a part of the cluster. I've got a main java class that
implements org.apache.hadoop.util.Tool and I'm able to run this job from
the NameNode using ToolRunner.run(), setting up the JobConf, and submitting
with JobClient.submitJob().

When I try to run the same class remotely from any machine that is not the
NameNode the job is submitted and it appears that the Map tasks
successfully complete, but I get the following exception for all of the
reduce tasks:

org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
output/map_0.out in any of the configured local directories
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160)
at 
org.apache.hadoop.mapred.MapOutputFile.getInputFile(MapOutputFile.java:161)
at org.apache.hadoop.mapred.ReduceTask.getMapFiles(ReduceTask.java:220)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:398)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)


I'm not sure how to interpret that error message. I think I'm missing
some config files that are not present on the remote machine but I
can't pin down exactly what I need. Does anyone have any guidance on
what the above error means or how to submit jobs remotely?

Thanks,

--Oren


Re: Automatically mapping a job submitted by a particular user to a specific hadoop map-reduce queue

2013-04-26 Thread Sandy Ryza
Sagar,

I'm glad to hear that it would help.  Unfortunately, we are no longer
adding features to CDH3, so you would have to upgrade to CDH4 or backport
it yourself to use it.

-Sandy


On Fri, Apr 26, 2013 at 10:27 AM, Sagar Mehta  wrote:

> Hi Sandy,
>
> Thanks for your prompt reply!!
>
> The jira that you pointed out would make it easy for us to do the
> automatic mapping and getting close towards enforcing a policy
> automatically. Any idea when it would be incorporated into cdh/hadoop
> releases and if it could be back-ported for cdh3u2 which we have currently
> running in production?
>
> Currently we are getting around this using the -Dmapred.job.queue.name="X"
> and the subsequent mapping of map-red job queue to Fair-share scheduler
> pool. We are using ACLs [more of a white-list] by
> configuring  mapred-queue-acls.xml to ensure people can only submit to the
> right queue.
>
> *Two limitations of this round-about approach are*
>
>1. It is manual
>2. It exposes the policy where user A is asked to submit jobs to queue
>X and user B is asked to submit jobs to queue Y [with different scheduler
>properties]. We want this to be completely transparent to the user of our
>cluster.
>
> The jira above would be a great first step towards such automatic mapping!!
>
> Cheers,
> Sagar
>
>
> On Wed, Apr 24, 2013 at 11:41 PM, Sandy Ryza wrote:
>
>> Hi Sagar,
>>
>> This capability currently does not exist in the fair scheduler (or other
>> schedulers, as far as I know), but a JIRA has been filed recently that
>> addresses a similar need.   Would
>> https://issues.apache.org/jira/browse/MAPREDUCE-5132 work for what
>> you're trying to do?  If not, would you mind filing a new JIRA for the
>> functionality you'd want?
>>
>> -Sandy
>>
>>
>> On Wed, Apr 24, 2013 at 6:22 PM, Sagar Mehta wrote:
>>
>>> Hi Guys,
>>>
>>> We have a general purpose Hive cluster [about 200 nodes] which is used
>>> for various jobs like
>>>
>>>- Production
>>>- Experimental/Research
>>>- Adhoc queries
>>>
>>> We are using the fair-share scheduler to schedule them and for this we
>>> have corresponding 3 pools in the scheduler.
>>>
>>> *Here is what we want.*
>>>
>>> *A hive query submitted by a user with user-name A should go to one of
>>> the pools above based on a pre-defined mapping. We are wondering where/how
>>> to specify this mapping?*
>>>
>>> *We can do this manually by adding -Dmapred.job.queue.name="X" on a
>>> particular job run.*
>>>
>>> This puts the job on the map-reduce queue named "X" and the following
>>> configuration in the fair-share scheduler
>>>
>>>   
>>> mapred.fairscheduler.poolnameproperty
>>> mapred.job.queue.name
>>>   
>>>
>>> maps this to a pool named "X" in the fair-share scheduler.
>>>
>>> However we [while wearing our Hadoop developer/admin hat] don't want the
>>> user/analyst to specify that so as to enforce some cluster-use policy.
>>>
>>> Based on his/her username we want to automatically select which hadoop
>>> queue and subsequently which fair-share scheduler pool, his/her job should
>>> go to. I'm pretty sure this is a common use-case and wondering how to do
>>> this in Hadoop.
>>>
>>> Any help/insights/pointers would be greatly appreciated.
>>>
>>> Sagar
>>> PS - Btw we are using Cloudera cdh3u2 and the user jobs are Hive queries.
>>>
>>>
>>>
>>>
>>
>


[no subject]

2013-04-26 Thread Mohsen B.Sarmadi
Hi,
I am newbi in hadoop,
I am running hadoop on Mac X 10. and i can't load any files in Hdfs.

first of all, i am getting this error

localhost: 2013-04-26 19:08:31.330 java[14436:1b03] Unable to load realm
info from SCDynamicStore

which from some posts i understand i should add this line to hadoop-env.sh.
but it didn't fix it.

export 
HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK-Djava.security.krb5.kdc=kdc0.ox.ac.uk:
kdc1.ox.ac.uk"
second , i can't load any files in Hdfs. i am trying to run hadoop in
pseudo-distributed
mode so i used configuration from
here for
it.
i am sure hadoop is loading my configurations because i have added a java
home into hadoop-env.sh successfully.
this is error i get:

m0h3n:hadoop-1.0.4 mohsen$ ./bin/hadoop dfs -put conf input
2013-04-26 19:18:04.185 java[14559:1703] Unable to load realm info from
SCDynamicStore
13/04/26 19:18:04 WARN hdfs.DFSClient: DataStreamer Exception:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/mohsen/input/capacity-scheduler.xml could only be replicated to 0
nodes, instead of 1
 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1558)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696)
 at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
 at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

at org.apache.hadoop.ipc.Client.call(Client.java:1070)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at com.sun.proxy.$Proxy1.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
 at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at com.sun.proxy.$Proxy1.addBlock(Unknown Source)
 at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3510)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3373)
 at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2600(DFSClient.java:2589)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2829)

13/04/26 19:18:04 WARN hdfs.DFSClient: Error Recovery for block null bad
datanode[0] nodes == null
13/04/26 19:18:04 WARN hdfs.DFSClient: Could not get block locations.
Source file "/user/mohsen/input/capacity-scheduler.xml" - Aborting...
put: java.io.IOException: File /user/mohsen/input/capacity-scheduler.xml
could only be replicated to 0 nodes, instead of 1
13/04/26 19:18:04 ERROR hdfs.DFSClient: Exception closing file
/user/mohsen/input/capacity-scheduler.xml :
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/mohsen/input/capacity-scheduler.xml could only be replicated to 0
nodes, instead of 1
 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1558)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696)
 at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
 at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/mohsen/input/capacity-scheduler.xml could only be replicated to 0
nodes, instead of 1
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1558)
 at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.la

Re: M/R job optimization

2013-04-26 Thread Ted Dunning
Have you checked the logs?

Is there a task that is taking a long time?  What is that task doing?

There are two basic possibilities:

a) you have a skewed join like the other Ted mentioned.  In this case, the
straggler will be seen to be working on data.

b) you have a hung process.  This can be more difficult to diagnose, but
indicates that there is a problem with your cluster.



On Fri, Apr 26, 2013 at 2:21 AM, Han JU  wrote:

> Hi,
>
> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
> questionis that in one of the jobs, map and reduce tasks show 100% finished
> in about 1m 30s, but I have to wait another 5m for this job to finish.
> This job writes about 720mb compressed data to HDFS with replication
> factor 1, in sequence file format. I've tried copying these data to hdfs,
> it takes only < 20 seconds. What happened during this 5 more minutes?
>
> Any idea on how to optimize this part?
>
> Thanks.
>
> --
> *JU Han*
>
> UTC   -  Université de Technologie de Compiègne
> * **GI06 - Fouille de Données et Décisionnel*
>
> +33 061960
>


Re: M/R Staticstics

2013-04-26 Thread rkevinburton


It seems to be related to some permission problem but I am not sure how 
to over come it:



2013-04-26 12:35:08,235 INFO 
org.apache.hadoop.mapreduce.v2.hs.JobHistory: JobHistory Init
2013-04-26 12:35:08,886 FATAL 
org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer: Error starting 
JobHistoryServer
org.apache.hadoop.yarn.YarnException: Error creating done directory: 
[hdfs://devubuntu05:9000/tmp/hadoop-yarn/staging/history/done]
at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.init(HistoryFileManager.java:424)
at 
org.apache.hadoop.mapreduce.v2.hs.JobHistory.init(JobHistory.java:87)
at 
org.apache.hadoop.yarn.service.CompositeService.init(CompositeService.java:58)
at 
org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.init(JobHistoryServer.java:87)
at 
org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.main(JobHistoryServer.java:145)

. . . . .


On Fri, Apr 26, 2013 at 10:37 AM, Rishi Yadav wrote:

 do you see "retired jobs" on job tracker page. There is also "job 
tracker history" on the bottom of page. 


something like this  http://nn.zettabyte.com:50030/jobtracker.jsp 


Thanks and Regards,
Rishi Yadav



On Fri, Apr 26, 2013 at 7:36 AM, < rkevinbur...@charter.net 
 

wrote:
When I submit a simple "Hello World" M/R job like WordCount it takes 
less than 5 seconds. The texts show numerous methods for monitoring M/R 
jobs as they are happening but I have yet to see any that show 
statistics about a job after it has completed. Obviously simple jobs 
that take a short amount of time don't allow time to fire up any web 
mage or monitoring tool to see how it progresses through the JobTracker 
and TaskTracker as well as which node it is processed on. Any 
suggestions on how could see this kind of data *after* a job has 
completed?


Re: Automatically mapping a job submitted by a particular user to a specific hadoop map-reduce queue

2013-04-26 Thread Sagar Mehta
Hi Vinod,

Yes this is exactly what we are doing right now which works but is manual
and exposes the policy.
I think the JIRA than Sandy pointed out -
https://issues.apache.org/jira/browse/MAPREDUCE-5132 is a good first step
in that direction.

Cheers,
Sagar

On Thu, Apr 25, 2013 at 1:44 PM, Vinod Kumar Vavilapalli <
vino...@hortonworks.com> wrote:

> The 'standard' way to do this is using queu-acls to enforce a particular
> user to be able to submit jobs to a sub-set of queues and then let the user
> decide which of that subset of queues he wishes to submit a job to.
>
> Thanks,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>
> On Apr 24, 2013, at 6:22 PM, Sagar Mehta wrote:
>
> Hi Guys,
>
> We have a general purpose Hive cluster [about 200 nodes] which is used for
> various jobs like
>
>- Production
>- Experimental/Research
>- Adhoc queries
>
> We are using the fair-share scheduler to schedule them and for this we
> have corresponding 3 pools in the scheduler.
>
> *Here is what we want.*
>
> *A hive query submitted by a user with user-name A should go to one of
> the pools above based on a pre-defined mapping. We are wondering where/how
> to specify this mapping?*
>
> *We can do this manually by adding -Dmapred.job.queue.name="X" on a
> particular job run.*
>
> This puts the job on the map-reduce queue named "X" and the following
> configuration in the fair-share scheduler
>
>   
> mapred.fairscheduler.poolnameproperty
> mapred.job.queue.name
>   
>
> maps this to a pool named "X" in the fair-share scheduler.
>
> However we [while wearing our Hadoop developer/admin hat] don't want the
> user/analyst to specify that so as to enforce some cluster-use policy.
>
> Based on his/her username we want to automatically select which hadoop
> queue and subsequently which fair-share scheduler pool, his/her job should
> go to. I'm pretty sure this is a common use-case and wondering how to do
> this in Hadoop.
>
> Any help/insights/pointers would be greatly appreciated.
>
> Sagar
> PS - Btw we are using Cloudera cdh3u2 and the user jobs are Hive queries.
>
>
>
>
>


Re: Automatically mapping a job submitted by a particular user to a specific hadoop map-reduce queue

2013-04-26 Thread Sagar Mehta
Hi Nitin,

Thanks for your reply.

Yes this is exactly what we are doing by asking the user to modify the
,hiverc and then using ACLs [white-lists] by configuring
mapred-queue-acls.xml to ensure people don't submit to wrong queues. [or
are not allowed to]

As I said in one of the other threads, besides being a manual approach, it
also exposes the policy where user A is asked to modify his/her .hiverc to
submit jobs to queue X and user B is asked to modify his/her .hiverc to
submit jobs to queue Y potentially with different scheduling properties. We
want this to be more or less transparent to the user.

We have a decent sized cluster [200 nodes] with more than 30+ different
users.

I think the JIRA that Sandy pointed out below is a good first step in that
direction.

Sagar

On Thu, Apr 25, 2013 at 3:04 AM, Nitin Pawar wrote:

> the current capacity scheduler guarantees that which users can submit jobs
> to which queue and other related features.
> More of which you can read at
> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>
> but on the hive side, unless you set mapred.job.queue.name on the hive
> cli, they will be submitted to default job queue.
>
> So basically what you would like to do is create user, associate it with a
> queue on scheduler and ask the user to modify its queue on local hiverc
> file.
>
> I am not sure if this can be part of hive's metastore. Because one user
> can be allowed to submit the job to multiple queues and then best way to
> handle it is via setting the property each time you open the session or via
> hiverc file
>
>
> On Thu, Apr 25, 2013 at 12:11 PM, Sandy Ryza wrote:
>
>> Hi Sagar,
>>
>> This capability currently does not exist in the fair scheduler (or other
>> schedulers, as far as I know), but a JIRA has been filed recently that
>> addresses a similar need.   Would
>> https://issues.apache.org/jira/browse/MAPREDUCE-5132 work for what
>> you're trying to do?  If not, would you mind filing a new JIRA for the
>> functionality you'd want?
>>
>> -Sandy
>>
>>
>> On Wed, Apr 24, 2013 at 6:22 PM, Sagar Mehta wrote:
>>
>>> Hi Guys,
>>>
>>> We have a general purpose Hive cluster [about 200 nodes] which is used
>>> for various jobs like
>>>
>>>- Production
>>>- Experimental/Research
>>>- Adhoc queries
>>>
>>> We are using the fair-share scheduler to schedule them and for this we
>>> have corresponding 3 pools in the scheduler.
>>>
>>> *Here is what we want.*
>>>
>>> *A hive query submitted by a user with user-name A should go to one of
>>> the pools above based on a pre-defined mapping. We are wondering where/how
>>> to specify this mapping?*
>>>
>>> *We can do this manually by adding -Dmapred.job.queue.name="X" on a
>>> particular job run.*
>>>
>>> This puts the job on the map-reduce queue named "X" and the following
>>> configuration in the fair-share scheduler
>>>
>>>   
>>> mapred.fairscheduler.poolnameproperty
>>> mapred.job.queue.name
>>>   
>>>
>>> maps this to a pool named "X" in the fair-share scheduler.
>>>
>>> However we [while wearing our Hadoop developer/admin hat] don't want the
>>> user/analyst to specify that so as to enforce some cluster-use policy.
>>>
>>> Based on his/her username we want to automatically select which hadoop
>>> queue and subsequently which fair-share scheduler pool, his/her job should
>>> go to. I'm pretty sure this is a common use-case and wondering how to do
>>> this in Hadoop.
>>>
>>> Any help/insights/pointers would be greatly appreciated.
>>>
>>> Sagar
>>> PS - Btw we are using Cloudera cdh3u2 and the user jobs are Hive queries.
>>>
>>>
>>>
>>>
>>
>
>
> --
> Nitin Pawar
>


Re: Automatically mapping a job submitted by a particular user to a specific hadoop map-reduce queue

2013-04-26 Thread Sagar Mehta
Hi Sandy,

Thanks for your prompt reply!!

The jira that you pointed out would make it easy for us to do the automatic
mapping and getting close towards enforcing a policy automatically. Any
idea when it would be incorporated into cdh/hadoop releases and if it could
be back-ported for cdh3u2 which we have currently running in production?

Currently we are getting around this using the -Dmapred.job.queue.name="X"
and the subsequent mapping of map-red job queue to Fair-share scheduler
pool. We are using ACLs [more of a white-list] by
configuring  mapred-queue-acls.xml to ensure people can only submit to the
right queue.

*Two limitations of this round-about approach are*

   1. It is manual
   2. It exposes the policy where user A is asked to submit jobs to queue X
   and user B is asked to submit jobs to queue Y [with different scheduler
   properties]. We want this to be completely transparent to the user of our
   cluster.

The jira above would be a great first step towards such automatic mapping!!

Cheers,
Sagar

On Wed, Apr 24, 2013 at 11:41 PM, Sandy Ryza wrote:

> Hi Sagar,
>
> This capability currently does not exist in the fair scheduler (or other
> schedulers, as far as I know), but a JIRA has been filed recently that
> addresses a similar need.   Would
> https://issues.apache.org/jira/browse/MAPREDUCE-5132 work for what you're
> trying to do?  If not, would you mind filing a new JIRA for the
> functionality you'd want?
>
> -Sandy
>
>
> On Wed, Apr 24, 2013 at 6:22 PM, Sagar Mehta  wrote:
>
>> Hi Guys,
>>
>> We have a general purpose Hive cluster [about 200 nodes] which is used
>> for various jobs like
>>
>>- Production
>>- Experimental/Research
>>- Adhoc queries
>>
>> We are using the fair-share scheduler to schedule them and for this we
>> have corresponding 3 pools in the scheduler.
>>
>> *Here is what we want.*
>>
>> *A hive query submitted by a user with user-name A should go to one of
>> the pools above based on a pre-defined mapping. We are wondering where/how
>> to specify this mapping?*
>>
>> *We can do this manually by adding -Dmapred.job.queue.name="X" on a
>> particular job run.*
>>
>> This puts the job on the map-reduce queue named "X" and the following
>> configuration in the fair-share scheduler
>>
>>   
>> mapred.fairscheduler.poolnameproperty
>> mapred.job.queue.name
>>   
>>
>> maps this to a pool named "X" in the fair-share scheduler.
>>
>> However we [while wearing our Hadoop developer/admin hat] don't want the
>> user/analyst to specify that so as to enforce some cluster-use policy.
>>
>> Based on his/her username we want to automatically select which hadoop
>> queue and subsequently which fair-share scheduler pool, his/her job should
>> go to. I'm pretty sure this is a common use-case and wondering how to do
>> this in Hadoop.
>>
>> Any help/insights/pointers would be greatly appreciated.
>>
>> Sagar
>> PS - Btw we are using Cloudera cdh3u2 and the user jobs are Hive queries.
>>
>>
>>
>>
>


Re: M/R Staticstics

2013-04-26 Thread rkevinburton


As an addendum I looked to see what was installed with  apt-cache and 
got the following outout


kevin@devUbuntu05:~$ apt-cache search hadoop
python-mrjob - MapReduce framework for writing and running Hadoop 
Streaming jobs
ubuntu-orchestra-modules-hadoop - Modules mainly used by 
orchestra-management-server
flume-ng - reliable, scalable, and manageable distributed data 
collection application

hadoop - A software platform for processing vast amounts of data
hadoop-0.20-conf-pseudo - Hadoop installation in pseudo-distributed mode 
with MRv1
hadoop-0.20-mapreduce - A software platform for processing vast amounts 
of data

hadoop-0.20-mapreduce-jobtracker - JobTracker for Hadoop
hadoop-0.20-mapreduce-jobtrackerha - High Availability JobTracker for 
Hadoop

hadoop-0.20-mapreduce-tasktracker - Task Tracker for Hadoop
hadoop-0.20-mapreduce-zkfc - Hadoop MapReduce failover controller
hadoop-client - Hadoop client side dependencies
hadoop-conf-pseudo - Pseudo-distributed Hadoop configuration
hadoop-doc - Documentation for Hadoop
hadoop-hdfs - The Hadoop Distributed File System
hadoop-hdfs-datanode - Data Node for Hadoop
hadoop-hdfs-fuse - HDFS exposed over a Filesystem in Userspace
hadoop-hdfs-journalnode - Hadoop HDFS JournalNode
hadoop-hdfs-namenode - Name Node for Hadoop
hadoop-hdfs-secondarynamenode - Secondary Name Node for Hadoop
hadoop-hdfs-zkfc - Hadoop HDFS failover controller
hadoop-httpfs - HTTPFS for Hadoop
hadoop-mapreduce - The Hadoop MapReduce (MRv2)
hadoop-mapreduce-historyserver - MapReduce History Server
hadoop-yarn - The Hadoop NextGen MapReduce (YARN)
hadoop-yarn-nodemanager - Node manager for Hadoop
hadoop-yarn-proxyserver - Web proxy for YARN
hadoop-yarn-resourcemanager - Resource manager for Hadoop
hbase - HBase is the Hadoop database
hcatalog - Apache HCatalog is a table and storage management service.
hive - A data warehouse infrastructure built on top of Hadoop
hue-common - A browser-based desktop interface for Hadoop
hue-filebrowser - A UI for the Hadoop Distributed File System (HDFS)
hue-jobbrowser - A UI for viewing Hadoop map-reduce jobs
hue-jobsub - A UI for designing and submitting map-reduce jobs to Hadoop
hue-plugins - Plug-ins for Hadoop to enable integration with Hue
hue-shell - A shell for console based Hadoop applications
libhdfs0 - JNI Bindings to access Hadoop HDFS from C
mahout - A set of Java libraries for scalable machine learning.
oozie - A workflow and coordinator sytem for Hadoop jobs.
pig - A platform for analyzing large data sets using Hadoop
pig-udf-datafu - A collection of user-defined functions for Hadoop and 
Pig.
sqoop - Tool for easy imports and exports of data sets between databases 
and HDFS
sqoop2 - Tool for easy imports and exports of data sets between 
databases and HDFS
webhcat - WEBHcat provides a REST-like web API for HCatalog and related 
Hadoop components.

cdh4-repository - Cloudera's Distribution including Apache Hadoop

So it seems that MapReduce is installed but I don't see anything in 
/etc/init.d to start it up. Ideas?


On Fri, Apr 26, 2013 at 10:37 AM, Rishi Yadav wrote:

 do you see "retired jobs" on job tracker page. There is also "job 
tracker history" on the bottom of page. 


something like this  http://nn.zettabyte.com:50030/jobtracker.jsp 


Thanks and Regards,
Rishi Yadav



On Fri, Apr 26, 2013 at 7:36 AM, < rkevinbur...@charter.net 
 

wrote:
When I submit a simple "Hello World" M/R job like WordCount it takes 
less than 5 seconds. The texts show numerous methods for monitoring M/R 
jobs as they are happening but I have yet to see any that show 
statistics about a job after it has completed. Obviously simple jobs 
that take a short amount of time don't allow time to fire up any web 
mage or monitoring tool to see how it progresses through the JobTracker 
and TaskTracker as well as which node it is processed on. Any 
suggestions on how could see this kind of data *after* a job has 
completed?


Re: M/R Staticstics

2013-04-26 Thread rkevinburton


I get a message like:

Oops! Google Chrome could not connect to devubuntu05:50030

Where devubuntu05 is the machine (JobTracker, NameNode) running hadoop.

I know it is running because when I do ps I get something like:

kevin@devUbuntu05:~$ ps aux | grep hadoop

hdfs  1095  0.0  2.7 1983656 113048 ?  Sl   Apr23   3:18 
/usr/lib/jvm/java-7-openjdk-amd64//bin/java -Dproc_datanode -Xmx1000m 
-Dhadoop.log.dir=/var/log/hadoop-hdfs 
-Dhadoop.log.file=hadoop-hdfs-datanode-devUbuntu05.log 
-Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=hdfs 
-Dhadoop.root.logger=INFO,RFA 
-Djava.library.path=/usr/lib/hadoop/lib/native 
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true 
-server -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote -Dhadoop.security.logger=INFO,RFAS 
org.apache.hadoop.hdfs.server.datanode.DataNode


hdfs  1415  0.1  3.1 1972424 126388 ?  Sl   Apr23   5:52 
/usr/lib/jvm/java-7-openjdk-amd64//bin/java -Dproc_namenode -Xmx1000m 
-Dhadoop.log.dir=/var/log/hadoop-hdfs 
-Dhadoop.log.file=hadoop-hdfs-namenode-devUbuntu05.log 
-Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=hdfs 
-Dhadoop.root.logger=INFO,RFA 
-Djava.library.path=/usr/lib/hadoop/lib/native 
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true 
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote -Dhadoop.security.logger=INFO,RFAS 
org.apache.hadoop.hdfs.server.namenode.NameNode


hdfs  1529  0.0  2.7 1961128 111680 ?  Sl   Apr23   1:49 
/usr/lib/jvm/java-7-openjdk-amd64//bin/java -Dproc_secondarynamenode 
-Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-hdfs 
-Dhadoop.log.file=hadoop-hdfs-secondarynamenode-devUbuntu05.log 
-Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=hdfs 
-Dhadoop.root.logger=INFO,RFA 
-Djava.library.path=/usr/lib/hadoop/lib/native 
-Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true 
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote -Dhadoop.security.logger=INFO,RFAS 
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode


yarn  1701  0.0  5.3 1998140 216820 ?  Sl   Apr23   2:36 
/usr/lib/jvm/java-7-openjdk-amd64//bin/java -Dproc_nodemanager -Xmx1000m 
-server -Dhadoop.log.dir=/var/log/hadoop-yarn 
-Dyarn.log.dir=/var/log/hadoop-yarn 
-Dhadoop.log.file=yarn-yarn-nodemanager-devUbuntu05.log 
-Dyarn.log.file=yarn-yarn-nodemanager-devUbuntu05.log 
-Dyarn.home.dir=/usr/lib/hadoop-yarn -Dhadoop.home.dir=/usr/lib/hadoo 
-yarn -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA 
-Djava.library.path=/usr/lib/hadoop/lib/native -classpath 
/etc/hadoop/conf:/etc/hadoop/conf:/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-yarn/lib/*:/etc/hadoop/conf/nm-config/log4j.properties 
org.apache.hadoop.yarn.server.nodemanager.NodeManager


yarn  1828  0.0  5.3 2128812 217740 ?  Sl   Apr23   2:38 
/usr/lib/jvm/java-7-openjdk-amd64//bin/java -Dproc_resourcemanager 
-Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-yarn 
-Dyarn.log.dir=/var/log/hadoop-yarn 
-Dhadoop.log.file=yarn-yarn-resourcemanager-devUbuntu05.log 
-Dyarn.log.file=yarn-yarn-resourcemanager-devUbuntu05.log 
-Dyarn.home.dir=/usr/lib/hadoop-yarn 
-Dhadoop.home.dir=/usr/lib/hadoop-yarn -Dhadoop.root.logger=INFO,RFA 
-Dyarn.root.logger=INFO,RFA 
-Djava.library.path=/usr/lib/hadoop/lib/native -classpath 
/etc/hadoop/conf:/etc/hadoop/conf:/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-yarn/lib/*:/etc/hadoop/conf/rm-config/log4j.properties 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager


So why isn't the JobTracker running?

On Fri, Apr 26, 2013 at 10:37 AM, Rishi Yadav wrote:

 do you see "retired jobs" on job tracker page. There is also "job 
tracker history" on the bottom of page. 


something like this  http://nn.zettabyte.com:50030/jobtracker.jsp 


Thanks and Regards,
Rishi Yadav



On Fri, Apr 26, 2013 at 7:36 AM, < rkevinbur...@charter.net 
 

wrote:
When I submit a simple "Hello World" M/R job like WordCount it takes 
less than 5 seconds. The texts show numerous methods for monitoring M/R 
jobs as they are happening but I have yet to see any that show 
statistics about a job after it has completed. Obviously simple jobs 
that take a short amount of time don't allow time to fire up any web 
mage or monitoring tool to see how it progresses through the JobTracker 
and TaskTracker as well as which node it

Re: M/R Staticstics

2013-04-26 Thread Rishi Yadav
do you see "retired jobs" on job tracker page. There is also "job tracker
history" on the bottom of page.

something like this http://nn.zettabyte.com:50030/jobtracker.jsp

Thanks and Regards,

Rishi Yadav




On Fri, Apr 26, 2013 at 7:36 AM,  wrote:

> When I submit a simple "Hello World" M/R job like WordCount it takes less
> than 5 seconds. The texts show numerous methods for monitoring M/R jobs as
> they are happening but I have yet to see any that show statistics about a
> job after it has completed. Obviously simple jobs that take a short amount
> of time don't allow time to fire up any web mage or monitoring tool to see
> how it progresses through the JobTracker and TaskTracker as well as which
> node it is processed on. Any suggestions on how could see this kind of data
> *after* a job has completed?
>


M/R Staticstics

2013-04-26 Thread rkevinburton


When I submit a simple "Hello World" M/R job like WordCount it takes 
less than 5 seconds. The texts show numerous methods for monitoring M/R 
jobs as they are happening but I have yet to see any that show 
statistics about a job after it has completed. Obviously simple jobs 
that take a short amount of time don't allow time to fire up any web 
mage or monitoring tool to see how it progresses through the JobTracker 
and TaskTracker as well as which node it is processed on. Any 
suggestions on how could see this kind of data *after* a job has 
completed?


Failed to install openssl-devel 1.0.0-20.el6 on OS RHELS 6.3 x86_64

2013-04-26 Thread sam liu
Hi,

For building Hadoop on OS RHELS 6.3 x86_64, I tried to install
openssl-devel, but failed. The exception is as below. The required version
of glibc-common is 2.12-1.47.el6, but mine installed one is 2.12-1.80.el6
and newer than it. Why does it fail? How to resolve this issue?

---> Package nss-softokn-freebl.i686 0:3.12.9-11.el6 will be installed
--> Finished Dependency Resolution
Error: Package: glibc-2.12-1.47.el6.i686 (rhel-cd)
   Requires: glibc-common = 2.12-1.47.el6
   Installed: glibc-common-2.12-1.80.el6.x86_64
(@anaconda-RedHatEnterpriseLinux-201206132210.x86_64/6.3)
   glibc-common = 2.12-1.80.el6
   Available: glibc-common-2.12-1.47.el6.x86_64 (rhel-cd)
   glibc-common = 2.12-1.47.el6
 You could try using --skip-broken to work around the problem
 You could try running: rpm -Va --nofiles --nodigest


Sam Liu

Thanks!


Re: M/R job optimization

2013-04-26 Thread Ted Xu
Hi Han,

It may be caused by skewed partitioning, which means some specific reducers
are assigned too much data than average, causing long tail. To verify that,
you can check the task counters, see if the partitioning is balanced enough.

Some tools implemented specific algorithms to handle this issue, for
example pig skewed join (http://wiki.apache.org/pig/PigSkewedJoinSpec)


On Fri, Apr 26, 2013 at 5:21 PM, Han JU  wrote:

> Hi,
>
> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
> questionis that in one of the jobs, map and reduce tasks show 100% finished
> in about 1m 30s, but I have to wait another 5m for this job to finish.
> This job writes about 720mb compressed data to HDFS with replication
> factor 1, in sequence file format. I've tried copying these data to hdfs,
> it takes only < 20 seconds. What happened during this 5 more minutes?
>
> Any idea on how to optimize this part?
>
> Thanks.
>
> --
> *JU Han*
>
> UTC   -  Université de Technologie de Compiègne
> * **GI06 - Fouille de Données et Décisionnel*
>
> +33 061960
>


Regards,

Ted Xu


M/R job optimization

2013-04-26 Thread Han JU
Hi,

I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
questionis that in one of the jobs, map and reduce tasks show 100% finished
in about 1m 30s, but I have to wait another 5m for this job to finish.
This job writes about 720mb compressed data to HDFS with replication factor
1, in sequence file format. I've tried copying these data to hdfs, it takes
only < 20 seconds. What happened during this 5 more minutes?

Any idea on how to optimize this part?

Thanks.

-- 
*JU Han*

UTC   -  Université de Technologie de Compiègne
* **GI06 - Fouille de Données et Décisionnel*

+33 061960