RE: Automate Hadoop installation

2011-12-05 Thread Sagar Shukla
Hi Praveenesh,
  I had created VMs images with OS /hadoop nodes pre-configured which I 
would start as per requirement. But if you plan to do at the hardware level 
then Linux provides with kickstart type of configuration, which allows OS / 
Package installations automatically (network configuration is done through 
DHCP). This requires a TFTP client and DHCP server and hardware supporting 
network boot capabilities.

Also something like puppet / shell-scripts can be configured like you 
mentioned, which I have used, but not for Hadoop.

Thanks,
Sagar

-Original Message-
From: praveenesh kumar [mailto:praveen...@gmail.com]
Sent: Monday, December 05, 2011 4:02 PM
To: common-user@hadoop.apache.org
Subject: Automate Hadoop installation

Hi all,

Can anyone guide me how to automate the hadoop installation/configuration 
process?
I want to install hadoop on 10-20 nodes which may even exceed to 50-100 nodes ?
I know we can use some configuration tools like puppet/or shell-scripts ?
Has anyone done it ?

How can we do hadoop installations on so many machines parallely ? What are the 
best practices for this ?

Thanks,
Praveenesh

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Re: Help with Hadoop Eclipse Plugin on Mac OS X Lion

2011-12-05 Thread Jignesh Patel
I am running 64bit version. Have you setup SSH properly?
On Dec 3, 2011, at 2:30 AM, Will L wrote:

 
 
 I am using 64-Bit Eclipse 3.7.1 Cocoa with Hadoop 0.20.205.0. I get the 
 following error message:
 An internal error occurred during: Connecting to DFS localhost.
 org/apache/commons/configuration/Configuration 
 
 From: seventeen_reas...@hotmail.com
 To: common-user@hadoop.apache.org
 Subject: RE: Help with Hadoop Eclipse Plugin on Mac OS X Lion
 Date: Fri, 2 Dec 2011 20:51:02 -0800
 
 
 What version of Hadoop are you running on OS X Lion and are you running 
 32-bit or 64-bit version of Eclipse?
 
 Subject: Re: Help with Hadoop Eclipse Plugin on Mac OS X Lion
 From: jign...@websoft.com
 Date: Fri, 2 Dec 2011 14:37:28 -0500
 To: common-user@hadoop.apache.org
 
 I am running eclipse plugin in Lion OS X on eclipse 3.7.
 
 Take the plugin from contrib folder in dump to your eclipse plugin library. 
 If doesn't work remove eclipse and reinstall a fresh version.
 
 -Jignesh
 
 On Dec 2, 2011, at 11:59 AM, Prashant Sharma wrote:
 
 nice to know Will, well the way i said you have the same luxury as far as
 you are running in stand-alone mode which is ideal for development.
 
 On Fri, Dec 2, 2011 at 10:02 PM, Will L 
 seventeen_reas...@hotmail.comwrote:
 
 
 
 I got the setup working under my laptop running OS X Snow Leopard without
 any problems and I would like to use my new laptop running OS X Lion.
 
 The plugin is helpful in that I can see hadoop output being dumped to the
 eclipse console and it used to integrate well with the Eclipse IDE making 
 my
 development life a little easier.
 
 Thank you for your time and help.
 
 Sincerely,
 
 Will Lieu
 
 Date: Fri, 2 Dec 2011 21:44:36 +0530
 Subject: Re: Help with Hadoop Eclipse Plugin on Mac OS X Lion
 From: prashant.ii...@gmail.com
 To: common-user@hadoop.apache.org
 
 Why do you need a plugin at all?
 
 you can do away with it by having a maven project i.e. having a pom.xml
 and
 setting hadoop as one of the dependencies. Then use regular maven
 commands
 to build etc.. e.g. mvn eclipse:eclipse would be an interesting command.
 
 On Fri, Dec 2, 2011 at 1:59 PM, Will L seventeen_reas...@hotmail.com
 wrote:
 
 
 
 Oops guess the formatting went away:
 I have tried the following combinations:
 * Hadoop 0.20.203, Eclipse 3.6.2 (32-bit),
 hadoop-eclipse-plugin-0.20.203.0.jar
 * Hadoop 0.20.203, Eclipse 3.6.2 (32-bit),
 hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA)
 * Hadoop 0.20.203 Eclipse 3.7.1 (32-bit),
 hadoop-eclipse-plugin-0.20.203.0.jar
 * Hadoop 0.20.203, Eclipse 3.7.1 (32-bit),
 hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA)
 * Hadoop 0.20.205, Eclipse 3.7.1 (32-bit),
 hadoop-eclipse-plugin-0.20.205.0.jar
 
 From: seventeen_reas...@hotmail.com
 To: common-user@hadoop.apache.org
 Subject: Help with Hadoop Eclipse Plugin on Mac OS X Lion
 Date: Fri, 2 Dec 2011 00:26:28 -0800
 
 
 
 
 
 
 Hello,
 I am having problems getting my hadoop eclipse plugin to work on Mac
 OS
 X Lion.
 
 I have tried the following combinations:
 Hadoop 0.20.203, Eclipse 3.6.2 (32-bit),
 hadoop-eclipse-plugin-0.20.203.0.jarHadoop 0.20.203, Eclipse 3.6.2
 (32-bit), hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA)Hadoop
 0.20.203, Eclipse 3.7.1 (32-bit),
 hadoop-eclipse-plugin-0.20.203.0.jarHadoop 0.20.203, Eclipse 3.7.1
 (32-bit), hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar (from JIRA)Hadoop
 0.20.205, Eclipse 3.7.1 (32-bit), hadoop-eclipse-plugin-0.20.205.0.jar
 
 Has anyone gotten the hadoop eclipse plugin to work on Mac OS X Lion?
 
 
 Thank you for your time and help I greatly appreciate it!
 
 
 Sincerely,
 
 
 Will
 
 
 
 
 
 
 

 



Multiple Mappers for Multiple Tables

2011-12-05 Thread Justin Vincent
I would like join some db tables, possibly from different databases, in a
MR job.

I would essentially like to use MultipleInputs, but that seems file
oriented. I need a different mapper for each db table.

Suggestions?

Thanks!

Justin Vincent


Re: Automate Hadoop installation

2011-12-05 Thread Konstantin Boudnik
These that great project called BigTop (in the apache incubator) which
provides for building of Hadoop stack.

The part of what it provides is a set of Puppet recipes which will allow you
to do exactly what you're looking for with perhaps some minor corrections.

Serious, look at Puppet - otherwise it will be a living through nightmare of
configuration mismanagements.

Cos

On Mon, Dec 05, 2011 at 04:02PM, praveenesh kumar wrote:
 Hi all,
 
 Can anyone guide me how to automate the hadoop installation/configuration
 process?
 I want to install hadoop on 10-20 nodes which may even exceed to 50-100
 nodes ?
 I know we can use some configuration tools like puppet/or shell-scripts ?
 Has anyone done it ?
 
 How can we do hadoop installations on so many machines parallely ? What are
 the best practices for this ?
 
 Thanks,
 Praveenesh


Hadoop Profiling

2011-12-05 Thread Bai Shen
I turned on the profiling in Hadoop, and the MapReduceTutorial at
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html says that
the profile files should go to the user log directory.  However, they're
currently going to the working directory where I start the hadoop job
from.  I've set $HADOOP_LOG_DIR but that hasn't made a difference.

What do I need to change or set in order for the profile files to go to the
correct log directory?

Thanks.


Re: Multiple Mappers for Multiple Tables

2011-12-05 Thread Bejoy Ks
Justin
If I get your requirement right you need to get in data from
multiple rdbms sources and do a join on the same, also may be some more
custom operations on top of this. For this you don't need to go in for
writing your custom mapreduce code unless it is that required. You can
achieve the same in two easy steps
- Import data from RDBMS into Hive using SQOOP (Import)
- Use hive to do some join and processing on this data

Hope it helps!..

Regards
Bejoy.K.S

On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent justi...@gmail.com wrote:

 I would like join some db tables, possibly from different databases, in a
 MR job.

 I would essentially like to use MultipleInputs, but that seems file
 oriented. I need a different mapper for each db table.

 Suggestions?

 Thanks!

 Justin Vincent



Pig Output

2011-12-05 Thread Aaron Griffith
Using PigStorage() my pig script output gets put into partial files on the 
hadoop
file system.

When I use the copyToLocal fuction from Hadoop it creates a local directory with
all the partial files.

Is there a way to copy the partial files from hadoop into a single local file?

Thanks



Re: Multiple Mappers for Multiple Tables

2011-12-05 Thread Bejoy Ks
Hi Justin,
Just to add on to my response. If you need to fetch data from
rdbms on your mapper using your custom mapreduce code you can use the
DBInputFormat in your mapper class with MultipleInputs. You have to be
careful in using the number of mappers for your application as dbs would be
constrained with a limit on maximum simultaneous connections. Also you need
to ensure that that the same Query is not executed n number of times in n
mappers all fetching the same data, It'd be just wastage of network. Sqoop
+ Hive would be my recommendation and a good combination for such use
cases. If you have Pig competency you can also look into pig instead of
hive.

Hope it helps!...

Regards
Bejoy.K.S

On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Justin
 If I get your requirement right you need to get in data from
 multiple rdbms sources and do a join on the same, also may be some more
 custom operations on top of this. For this you don't need to go in for
 writing your custom mapreduce code unless it is that required. You can
 achieve the same in two easy steps
 - Import data from RDBMS into Hive using SQOOP (Import)
 - Use hive to do some join and processing on this data

 Hope it helps!..

 Regards
 Bejoy.K.S


 On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent justi...@gmail.comwrote:

 I would like join some db tables, possibly from different databases, in a
 MR job.

 I would essentially like to use MultipleInputs, but that seems file
 oriented. I need a different mapper for each db table.

 Suggestions?

 Thanks!

 Justin Vincent





Re: Pig Output

2011-12-05 Thread Bejoy Ks
Hi Aaron
 Instead of copyFromLocal use getmerge. It would do your job. The
syntax for CLI is
hadoop fs -getmerge source dir in hdfs/pig output dir lfs destn
dir/xyz.txt


Hope it helps!...

Regards
Bejoy.K.S

On Tue, Dec 6, 2011 at 1:57 AM, Aaron Griffith
aaron.c.griff...@gmail.comwrote:

 Using PigStorage() my pig script output gets put into partial files on the
 hadoop
 file system.

 When I use the copyToLocal fuction from Hadoop it creates a local
 directory with
 all the partial files.

 Is there a way to copy the partial files from hadoop into a single local
 file?

 Thanks




MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)

2011-12-05 Thread Chris Curtin
Hi,

Using: *Version:* 0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14,
8 node cluster, 64 bit Centos

We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer
jobs. When we investigate it looks like the TaskTracker on the node being
fetched from is not running. Looking at the logs we see what looks like a
self-initiated shutdown:

2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM :
jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of tasks
it ran: 0
2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught
Throwable in JVMRunner. Aborting TaskTracker.
java.lang.NullPointerException
at
org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145)
at
org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129)
at
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472)
at
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446)
2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker:
SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118
/

Then the reducers have the following:


2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask:
java.net.ConnectException: Connection refused
 at java.net.PlainSocketImpl.socketConnect(Native Method)
 at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
 at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
 at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
 at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
 at java.net.Socket.connect(Socket.java:529)
 at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
 at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
 at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
 at sun.net.www.http.HttpClient.init(HttpClient.java:233)
 at sun.net.www.http.HttpClient.New(HttpClient.java:306)
 at sun.net.www.http.HttpClient.New(HttpClient.java:323)
 at
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970)
 at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
 at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
 at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525)
 at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482)
 at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390)
 at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301)
 at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233)

2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task
attempt_201112050908_0169_r_05_0: Failed fetch #2 from
attempt_201112050908_0169_m_02_0
2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed to
fetch map-output from attempt_201112050908_0169_m_02_0 even after
MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to
the JobTracker
2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask: Shuffle
failed with too many fetch failures and insufficient progress!Killing task
attempt_201112050908_0169_r_05_0.
2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_201112050908_0169_r_05_0 adding host had11.atlis1 to penalty
box, next contact in 8 seconds
2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201112050908_0169_r_05_0: Got 1 map-outputs from previous
failures
The job then fails.

Several questions:
1. what is causing the TaskTracker to fail/exit? This is after running
hundreds to thousands of jobs, so it's not just at start-up.
2. why isn't hadoop detecting that the reducers need something from a dead
mapper and restarting the mapper job, even it means aborting the reducers?
3. why isn't the DataNode being used to fetch the blocks? It is still up
and running when this happens, so shouldn't it know where the files are in
HDFS?

Thanks,

Chris


Running a job continuously

2011-12-05 Thread burakkk
Hi everyone,
I want to run a MR job continuously. Because i have streaming data and i
try to analyze it all the time in my way(algorithm). For example you want
to solve wordcount problem. It's the simplest one :) If you have some
multiple files and the new files are keep going, how do you handle it?
You could execute a MR job per one file but you have to do it repeatly. So
what do you think?

Thanks
Best regards...

-- 

*BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
*
*


Re: MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)

2011-12-05 Thread Todd Lipcon
Hi Chris,

I'd suggest updating to a newer version of your hadoop distro - you're
hitting some bugs that were fixed last summer. In particular, you're
missing the amendment patch from MAPREDUCE-2373 as well as some
patches to MR which make the fetch retry behavior more aggressive.

-Todd

On Mon, Dec 5, 2011 at 12:45 PM, Chris Curtin curtin.ch...@gmail.com wrote:
 Hi,

 Using: *Version:* 0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14,
 8 node cluster, 64 bit Centos

 We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer
 jobs. When we investigate it looks like the TaskTracker on the node being
 fetched from is not running. Looking at the logs we see what looks like a
 self-initiated shutdown:

 2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM :
 jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of tasks
 it ran: 0
 2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught
 Throwable in JVMRunner. Aborting TaskTracker.
 java.lang.NullPointerException
        at
 org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145)
        at
 org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129)
        at
 org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472)
        at
 org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446)
 2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker:
 SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118
 /

 Then the reducers have the following:


 2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask:
 java.net.ConnectException: Connection refused
  at java.net.PlainSocketImpl.socketConnect(Native Method)
  at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
  at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
  at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
  at java.net.Socket.connect(Socket.java:529)
  at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
  at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
  at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
  at sun.net.www.http.HttpClient.init(HttpClient.java:233)
  at sun.net.www.http.HttpClient.New(HttpClient.java:306)
  at sun.net.www.http.HttpClient.New(HttpClient.java:323)
  at
 sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970)
  at
 sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
  at
 sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
  at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525)
  at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482)
  at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390)
  at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301)
  at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233)

 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task
 attempt_201112050908_0169_r_05_0: Failed fetch #2 from
 attempt_201112050908_0169_m_02_0
 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed to
 fetch map-output from attempt_201112050908_0169_m_02_0 even after
 MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to
 the JobTracker
 2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask: Shuffle
 failed with too many fetch failures and insufficient progress!Killing task
 attempt_201112050908_0169_r_05_0.
 2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask:
 attempt_201112050908_0169_r_05_0 adding host had11.atlis1 to penalty
 box, next contact in 8 seconds
 2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask:
 attempt_201112050908_0169_r_05_0: Got 1 map-outputs from previous
 failures
 The job then fails.

 Several questions:
 1. what is causing the TaskTracker to fail/exit? This is after running
 hundreds to thousands of jobs, so it's not just at start-up.
 2. why isn't hadoop detecting that the reducers need something from a dead
 mapper and restarting the mapper job, even it means aborting the reducers?
 3. why isn't the DataNode being used to fetch the blocks? It is still up
 and running when this happens, so shouldn't it know where the files are in
 HDFS?

 Thanks,

 Chris



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)

2011-12-05 Thread Bejoy Ks
Hi Chris
 From the stack trace, it looks like a JVM corruption issue. It is
a known issue and have been fixed in CDH3u2, i believe an upgrade would
solve your issues.
https://issues.apache.org/jira/browse/MAPREDUCE-3184

Then regarding your queries,I'd try to help you out a bit.In mapreduce the
data transfer between map and reduce happens over http. If jetty is down
then that won't happen which means map output in one location wont be
accessible to reducer in another location. The map outputs are in LFS and
not on HDFS so even if the data node on the machine is up we can't get the
data in above circumstances.

Hope it helps!..

Regards
Bejoy.K.S


On Tue, Dec 6, 2011 at 2:15 AM, Chris Curtin curtin.ch...@gmail.com wrote:

 Hi,

 Using: *Version:* 0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14,
 8 node cluster, 64 bit Centos

 We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer
 jobs. When we investigate it looks like the TaskTracker on the node being
 fetched from is not running. Looking at the logs we see what looks like a
 self-initiated shutdown:

 2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM :
 jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of tasks
 it ran: 0
 2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught
 Throwable in JVMRunner. Aborting TaskTracker.
 java.lang.NullPointerException
at

 org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145)
at

 org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129)
at

 org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472)
at

 org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446)
 2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker:
 SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118
 /

 Then the reducers have the following:


 2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask:
 java.net.ConnectException: Connection refused
  at java.net.PlainSocketImpl.socketConnect(Native Method)
  at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
  at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
  at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
  at java.net.Socket.connect(Socket.java:529)
  at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
  at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
  at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
  at sun.net.www.http.HttpClient.init(HttpClient.java:233)
  at sun.net.www.http.HttpClient.New(HttpClient.java:306)
  at sun.net.www.http.HttpClient.New(HttpClient.java:323)
  at

 sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970)
  at

 sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
  at

 sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
  at

 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525)
  at

 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482)
  at

 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390)
  at

 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301)
  at

 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233)

 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task
 attempt_201112050908_0169_r_05_0: Failed fetch #2 from
 attempt_201112050908_0169_m_02_0
 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed to
 fetch map-output from attempt_201112050908_0169_m_02_0 even after
 MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to
 the JobTracker
 2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask: Shuffle
 failed with too many fetch failures and insufficient progress!Killing task
 attempt_201112050908_0169_r_05_0.
 2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask:
 attempt_201112050908_0169_r_05_0 adding host had11.atlis1 to penalty
 box, next contact in 8 seconds
 2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask:
 attempt_201112050908_0169_r_05_0: Got 1 map-outputs from previous
 failures
 The job then fails.

 Several questions:
 1. what is causing the TaskTracker to fail/exit? This is after running
 hundreds to thousands of jobs, so it's not just at start-up.
 2. why isn't hadoop detecting that the reducers need something from a dead
 

Re: Running a job continuously

2011-12-05 Thread Bejoy Ks
Burak
   If you have a continuous inflow of data, you can choose flume to
aggregate the files into larger sequence files or so if they are small and
when you have a substantial chunk of data(equal to hdfs block size). You
can push that data on to hdfs based on your SLAs you need to schedule your
jobs using oozie or simpe shell script. In very simple terms
- push input data (could be from flume collector) into a staging hdfs dir
- before triggering the job(hadoop jar) copy the input from staging to main
input dir
- execute the job
- archive the input and output into archive dirs(any other dirs).
   - the output archive dir could be source of output data
- delete output dir and empty input dir

Hope it helps!...

Regards
Bejoy.K.S

On Tue, Dec 6, 2011 at 2:19 AM, burakkk burak.isi...@gmail.com wrote:

 Hi everyone,
 I want to run a MR job continuously. Because i have streaming data and i
 try to analyze it all the time in my way(algorithm). For example you want
 to solve wordcount problem. It's the simplest one :) If you have some
 multiple files and the new files are keep going, how do you handle it?
 You could execute a MR job per one file but you have to do it repeatly. So
 what do you think?

 Thanks
 Best regards...

 --

 *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
 *
 *



Re: Running a job continuously

2011-12-05 Thread Mike Spreitzer
Burak,
Before we can really answer your question, you need to give us some more 
information on the processing you want to do.  Do you want output that is 
continuous or batched (if so, how)?  How should the output at a given time 
be related to the input up to then and the previous outputs?

Regards,
Mike 


Re: Pig Output

2011-12-05 Thread Russell Jurney
hadoop dfs cat /my/path/*  single_file

Russell Jurney
twitter.com/rjurney
russell.jur...@gmail.com
datasyndrome.com

On Dec 5, 2011, at 12:30 PM, Aaron Griffith aaron.c.griff...@gmail.com wrote:

 Using PigStorage() my pig script output gets put into partial files on the 
 hadoop
 file system.

 When I use the copyToLocal fuction from Hadoop it creates a local directory 
 with
 all the partial files.

 Is there a way to copy the partial files from hadoop into a single local file?

 Thanks



Re: Running a job continuously

2011-12-05 Thread John Conwell
You might also want to take a look at Storm, as thats what its design to
do: https://github.com/nathanmarz/storm/wiki

On Mon, Dec 5, 2011 at 1:34 PM, Mike Spreitzer mspre...@us.ibm.com wrote:

 Burak,
 Before we can really answer your question, you need to give us some more
 information on the processing you want to do.  Do you want output that is
 continuous or batched (if so, how)?  How should the output at a given time
 be related to the input up to then and the previous outputs?

 Regards,
 Mike




-- 

Thanks,
John C


Re: Multiple Mappers for Multiple Tables

2011-12-05 Thread Justin Vincent
Thanks Bejoy,
I was looking at DBInputFormat with MultipleInputs. MultipleInputs takes a
Path parameter. Are these paths just ignored here?

On Mon, Dec 5, 2011 at 2:31 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Hi Justin,
Just to add on to my response. If you need to fetch data from
 rdbms on your mapper using your custom mapreduce code you can use the
 DBInputFormat in your mapper class with MultipleInputs. You have to be
 careful in using the number of mappers for your application as dbs would be
 constrained with a limit on maximum simultaneous connections. Also you need
 to ensure that that the same Query is not executed n number of times in n
 mappers all fetching the same data, It'd be just wastage of network. Sqoop
 + Hive would be my recommendation and a good combination for such use
 cases. If you have Pig competency you can also look into pig instead of
 hive.

 Hope it helps!...

 Regards
 Bejoy.K.S

 On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks bejoy.had...@gmail.com wrote:

  Justin
  If I get your requirement right you need to get in data from
  multiple rdbms sources and do a join on the same, also may be some more
  custom operations on top of this. For this you don't need to go in for
  writing your custom mapreduce code unless it is that required. You can
  achieve the same in two easy steps
  - Import data from RDBMS into Hive using SQOOP (Import)
  - Use hive to do some join and processing on this data
 
  Hope it helps!..
 
  Regards
  Bejoy.K.S
 
 
  On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent justi...@gmail.com
 wrote:
 
  I would like join some db tables, possibly from different databases, in
 a
  MR job.
 
  I would essentially like to use MultipleInputs, but that seems file
  oriented. I need a different mapper for each db table.
 
  Suggestions?
 
  Thanks!
 
  Justin Vincent
 
 
 



Re: Running a job continuously

2011-12-05 Thread Abhishek Pratap Singh
Hi Burak,

The model of hadoop is very different, it is based on Job based model, in
more easy words its a kind of Batch model where map reduce job is executed
on a batch of data which is already present.
As per your requirement, word count example doesn't make sense if the file
has been written continuously.
However word count per hour or per min make sense in map reduce type of
program.
I second  what Bejoy has mentioned is to use flume, aggregate the data and
then run map reduce.
Hadoop can give you near real time by using flume with Map reduce where you
can run map reduce job on flume dumped data every few mins.
Second option is to see if your problem  can be solved by Flume Decorator
itself for real time experience.

Regards,
Abhishek

On Mon, Dec 5, 2011 at 2:33 PM, burakkk burak.isi...@gmail.com wrote:

 Athanasios Papaoikonomou, cron job isn't useful for me. Because i want to
 execute the MR job on the same algorithm but different files have different
 velocity.

 Both Storm and facebook's hadoop are designed for that. But i want to use
 apache distribution.

 Bejoy Ks, i have a continuous inflow of data but i think i need a near
 real-time system.

 Mike Spreitzer, both output and input are continuous. Output isn't relevant
 to the input. Only that i want is all the incoming files are processed by
 the same job and the same algorithm.
 For ex, you think about wordcount problem. When you want to run wordcount,
 you implement that:
 http://wiki.apache.org/hadoop/WordCount

 But when the program find that code job.waitForCompletion(true);, somehow
 job will end up. When you want to make it continuously, what will you do in
 hadoop without other tools?
 One more thing is you assumption that the input file's name is
 filename_timestamp(filename_20111206_0030)

 public static void main(String[] args) throws Exception {Configuration
 conf = new Configuration();Job job = new Job(conf,
 wordcount);job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(IntWritable.class);
 job.setMapperClass(Map.class);job.setReducerClass(Reduce.class);
  job.setInputFormatClass(TextInputFormat.class);
 job.setOutputFormatClass(TextOutputFormat.class);
 FileInputFormat.addInputPath(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));
 job.waitForCompletion(true); }

 On Mon, Dec 5, 2011 at 11:19 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

  Burak
 If you have a continuous inflow of data, you can choose flume to
  aggregate the files into larger sequence files or so if they are small
 and
  when you have a substantial chunk of data(equal to hdfs block size). You
  can push that data on to hdfs based on your SLAs you need to schedule
 your
  jobs using oozie or simpe shell script. In very simple terms
  - push input data (could be from flume collector) into a staging hdfs dir
  - before triggering the job(hadoop jar) copy the input from staging to
  main input dir
  - execute the job
  - archive the input and output into archive dirs(any other dirs).
 - the output archive dir could be source of output data
  - delete output dir and empty input dir
 
  Hope it helps!...
 
  Regards
  Bejoy.K.S
 
  On Tue, Dec 6, 2011 at 2:19 AM, burakkk burak.isi...@gmail.com wrote:
 
  Hi everyone,
  I want to run a MR job continuously. Because i have streaming data and i
  try to analyze it all the time in my way(algorithm). For example you
 want
  to solve wordcount problem. It's the simplest one :) If you have some
  multiple files and the new files are keep going, how do you handle it?
  You could execute a MR job per one file but you have to do it repeatly.
 So
  what do you think?
 
  Thanks
  Best regards...
 
  --
 
  *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
  *
  *
 
 
 


 --

 *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
 *
 *



RE: Running a job continuously

2011-12-05 Thread Ravi teja ch n v
Hi Burak,

Bejoy Ks, i have a continuous inflow of data but i think i need a near
real-time system.

Just to add to Bejoy's point, 
with Oozie, you can specify the data dependency for running your job.
When specific amount of data is in, your can configure Oozie to run your job.
I think this will suffice your requirement.

Regards,
Ravi Teja


From: burakkk [burak.isi...@gmail.com]
Sent: 06 December 2011 04:03:59
To: mapreduce-u...@hadoop.apache.org
Cc: common-user@hadoop.apache.org
Subject: Re: Running a job continuously

Athanasios Papaoikonomou, cron job isn't useful for me. Because i want to
execute the MR job on the same algorithm but different files have different
velocity.

Both Storm and facebook's hadoop are designed for that. But i want to use
apache distribution.

Bejoy Ks, i have a continuous inflow of data but i think i need a near
real-time system.

Mike Spreitzer, both output and input are continuous. Output isn't relevant
to the input. Only that i want is all the incoming files are processed by
the same job and the same algorithm.
For ex, you think about wordcount problem. When you want to run wordcount,
you implement that:
http://wiki.apache.org/hadoop/WordCount

But when the program find that code job.waitForCompletion(true);, somehow
job will end up. When you want to make it continuously, what will you do in
hadoop without other tools?
One more thing is you assumption that the input file's name is
filename_timestamp(filename_20111206_0030)

public static void main(String[] args) throws Exception {Configuration
conf = new Configuration();Job job = new Job(conf,
wordcount);job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);job.setReducerClass(Reduce.class);
  job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true); }

On Mon, Dec 5, 2011 at 11:19 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Burak
If you have a continuous inflow of data, you can choose flume to
 aggregate the files into larger sequence files or so if they are small and
 when you have a substantial chunk of data(equal to hdfs block size). You
 can push that data on to hdfs based on your SLAs you need to schedule your
 jobs using oozie or simpe shell script. In very simple terms
 - push input data (could be from flume collector) into a staging hdfs dir
 - before triggering the job(hadoop jar) copy the input from staging to
 main input dir
 - execute the job
 - archive the input and output into archive dirs(any other dirs).
- the output archive dir could be source of output data
 - delete output dir and empty input dir

 Hope it helps!...

 Regards
 Bejoy.K.S

 On Tue, Dec 6, 2011 at 2:19 AM, burakkk burak.isi...@gmail.com wrote:

 Hi everyone,
 I want to run a MR job continuously. Because i have streaming data and i
 try to analyze it all the time in my way(algorithm). For example you want
 to solve wordcount problem. It's the simplest one :) If you have some
 multiple files and the new files are keep going, how do you handle it?
 You could execute a MR job per one file but you have to do it repeatly. So
 what do you think?

 Thanks
 Best regards...

 --

 *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
 *
 *





--

*BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
*
*


Re: Availability of Job traces or logs

2011-12-05 Thread Amar Kamat
Arun,
 I want to test its behaviour under different size of jobs traces(meaning 
 number of jobs say 5,10,25,50,100) under different
 number of nodes.
 Till now i was using only the test/data given by mumak which has 19 jobs and 
 1529 node topology. I don' have many nodes
 with me to run some programs and collect logs and use Rumen to generate 
 traces.
For the varying jobs part, you can run sleep jobs with varying number of 
map/reduce tasks and sleep times. For varying the cluster size, you can run 
multiple task-trackers on the same node. You can start with 5 tracker per node. 
Since you will be running sleep jobs, this should be ok. Make sure Hadoop 
security is turned off and default controller is used. Intelligently design 
your topology script which will club all the trackers on the same node under 
one rack.

 I want to control the split placements so i need to modify preferred 
 locations for task attempts in the trace but the trace for
 even 19 jobs is huge. So, I was thinking whether i can get a small, medium 
 and large number of Job traces with
 corresponding topology trace so that modifying will be easier.
For this, you need to understand how Rumen handles job logs. I have created 
MAPREDUCE-3508 for adding filtering capabilities to Rumen. You can make use of 
this feature to modify Rumen output and play around with splits. You can also 
make use of this feature to select few jobs (say 10, 50 etc) from the input 
trace.

Amar

On 12/4/11 10:19 AM, ArunKumar arunk...@gmail.com wrote:

Amar,

I am attempting to write a new scheduler for Hadoop and test it using Mumak.

1 I want to test its behaviour under different size of jobs traces(meaning
number of jobs say 5,10,25,50,100) under different number of nodes.
Till now i was using only the test/data given by mumak which has 19 jobs
and 1529 node topology.
I don' have many nodes with me to run some programs and collect logs and
use Rumen to generate traces.

2 I want to control the split placements so i need to modify preferred
locations for task attempts in the trace but the trace for even 19 jobs is
huge. So, I was thinking whether i can get a small, medium and large number
of Job traces with corresponding topology trace so that modifying will be
easier.


Arun


On Sat, Dec 3, 2011 at 1:15 PM, Amar Kamat [via Lucene] 
ml-node+s472066n3556710...@n3.nabble.com wrote:

 Arun,
 You can very well run synthetic workloads like large scale sort, wordcount
 etc or more realistic workloads like PigMix (
 https://cwiki.apache.org/confluence/display/PIG/PigMix). On a decent
 enough cluster, these workloads work pretty well. Is there a specific
 reason why you want traces of varied sizes from various organizations?

  How can i make sure that the rumen generates only say 25 jobs,50 jobs or
 so
 Do you want to get 25/50 jobs based on some filtering criterion? I
 recently faced a similar situation where I wanted to extract jobs from a
 Rumen trace based on job ids. I will be happy to share these filtering
 tools.

 Amar


 On 12/1/11 8:48 AM, ArunKumar [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=3556710i=0
 wrote:

 Hi guys !

 Apart from generating the job traces from RUMEN , can i get logs or job
 traces of varied sizes from some organizations.

 How can i make sure that the rumen generates only say 25 jobs,50 jobs or
 so
 ?


 Thanks,
 Arun

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3550462.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3556710.html
  To unsubscribe from Availability of Job traces or logs, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3550462code=YXJ1bms3ODZAZ21haWwuY29tfDM1NTA0NjJ8NzA5NTc4MTY3
 .
 NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.InstantMailNamespacebreadcrumbs=instant+emails%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3558530.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Re: Automate Hadoop installation

2011-12-05 Thread alo alt
Hi,

to deploy software I suggest pulp:
https://fedorahosted.org/pulp/wiki/HowTo

For a package-based distro (debian, redhat, centos) you can build apache's
hadoop, pack it and delpoy. Configs, as Cos say, over puppet. If you use a
redhat / centos take a look at spacewalk.

best,
 Alex


On Mon, Dec 5, 2011 at 8:20 PM, Konstantin Boudnik c...@apache.org wrote:

 These that great project called BigTop (in the apache incubator) which
 provides for building of Hadoop stack.

 The part of what it provides is a set of Puppet recipes which will allow
 you
 to do exactly what you're looking for with perhaps some minor corrections.

 Serious, look at Puppet - otherwise it will be a living through nightmare
 of
 configuration mismanagements.

 Cos

 On Mon, Dec 05, 2011 at 04:02PM, praveenesh kumar wrote:
  Hi all,
 
  Can anyone guide me how to automate the hadoop installation/configuration
  process?
  I want to install hadoop on 10-20 nodes which may even exceed to 50-100
  nodes ?
  I know we can use some configuration tools like puppet/or shell-scripts ?
  Has anyone done it ?
 
  How can we do hadoop installations on so many machines parallely ? What
 are
  the best practices for this ?
 
  Thanks,
  Praveenesh




-- 
Alexander Lorenz
http://mapredit.blogspot.com

*P **Think of the environment: please don't print this email unless you
really need to.*