RE: Right way to implement MR ?

2012-05-24 Thread Shailesh Dargude
Samir,
  
Depends upon your data format and what you want to achieve. More 
information could have helped

-Shailesh.

-Original Message-
From: samir das mohapatra [mailto:samir.help...@gmail.com] 
Sent: Thursday, May 24, 2012 1:17 AM
To: common-user@hadoop.apache.org
Subject: Right way to implement MR ?

Hi All,
 How to compare to input file In M/R Job.
 let A Log file around 30GB
and B Log file size is around 60 GB

  I wanted to know how  i will  define  inside the mapper.

 Thanks
  samir.


Re: Right way to implement MR ?

2012-05-24 Thread samir das mohapatra
Thanks
  Harsh J for your help.

On Thu, May 24, 2012 at 1:24 AM, Harsh J  wrote:

> Samir,
>
> You can use MultipleInputs for multiple forms of inputs per mapper
> (with their own input K/V types, but common output K/V types) with a
> common reduce-side join/compare.
>
> See
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html
> .
>
> On Thu, May 24, 2012 at 1:17 AM, samir das mohapatra
>  wrote:
> > Hi All,
> > How to compare to input file In M/R Job.
> > let A Log file around 30GB
> >and B Log file size is around 60 GB
> >
> >  I wanted to know how  i will  define  inside the mapper.
> >
> >  Thanks
> >  samir.
>
>
>
> --
> Harsh J
>


Re: 3 machine cluster trouble

2012-05-24 Thread Pat Ferrel
ok, so all nodes are configured the same except for master/slave 
differences. They are all running hdfs all daemons seem to be running 
when I do a start-all.sh from the master. However the master Map/Reduce 
Administration page shows only two live nodes. The HDFS page shows 3.


Looking at the log files on the new slave node I see no outright errors 
but see this in the tasktracker log file. All machines have 8G memory. I 
think the important part below is TaskTracker's 
totalMemoryAllottedForTasks is -1. I've searched for others with this 
problem but haven't found something for my case, which is just trying to 
startup. No tasks have been run.


2012-05-24 11:20:46,786 INFO org.apache.hadoop.mapred.TaskTracker: 
Starting tracker tracker_occam3:localhost/127.0.0.1:45700
2012-05-24 11:20:46,792 INFO org.apache.hadoop.mapred.TaskTracker: 
Starting thread: Map-events fetcher for all reduce tasks on 
tracker_occam3:localhost/127.0.0.1:45700
2012-05-24 11:20:46,792 INFO org.apache.hadoop.mapred.TaskTracker:  
Using ResourceCalculatorPlugin : 
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5abd09e8
2012-05-24 11:20:46,795 WARN org.apache.hadoop.mapred.TaskTracker: 
TaskTracker's totalMemoryAllottedForTasks is -1. TaskMemoryManager is 
disabled.
2012-05-24 11:20:46,795 INFO org.apache.hadoop.mapred.IndexCache: 
IndexCache created with max memory = 10485760
2012-05-24 11:20:46,800 INFO org.apache.hadoop.mapred.TaskTracker: 
Shutting down: Map-events fetcher for all reduce tasks on 
tracker_occam3:localhost/127.0.0.1:45700
2012-05-24 11:20:46,800 INFO 
org.apache.hadoop.filecache.TrackerDistributedCacheManager: Cleanup...

java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at 
org.apache.hadoop.filecache.TrackerDistributedCacheManager$CleanupThread.run(TrackerDistributedCacheManager.java:926)
2012-05-24 11:20:46,900 INFO org.apache.hadoop.ipc.Server: Stopping 
server on 45700
2012-05-24 11:20:46,901 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 3 on 45700: exiting
2012-05-24 11:20:46,901 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 1 on 45700: exiting
2012-05-24 11:20:46,902 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 2 on 45700: exiting
2012-05-24 11:20:46,902 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
Server listener on 45700
2012-05-24 11:20:46,901 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 0 on 45700: exiting
2012-05-24 11:20:46,904 INFO 
org.apache.hadoop.ipc.metrics.RpcInstrumentation: shut down
2012-05-24 11:20:46,904 INFO org.apache.hadoop.mapred.TaskTracker: 
Shutting down StatusHttpServer
2012-05-24 11:20:46,904 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 7 on 45700: exiting
2012-05-24 11:20:46,903 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 6 on 45700: exiting
2012-05-24 11:20:46,903 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 4 on 45700: exiting
2012-05-24 11:20:46,904 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 5 on 45700: exiting
2012-05-24 11:20:46,904 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
Server Responder
2012-05-24 11:20:46,909 INFO org.mortbay.log: Stopped 
SelectChannelConnector@0.0.0.0:50060




On 5/23/12 3:55 PM, James Warren wrote:

Hi Pat -

The setting for hadoop.tmp.dir is used both locally and on HDFS and
therefore should be consistent across your cluster.

http://stackoverflow.com/questions/2354525/what-should-be-hadoop-tmp-dir

cheers,
-James

On Wed, May 23, 2012 at 3:44 PM, Pat Ferrel  wrote:


I have a two machine cluster and am adding a new machine. The new node has
a different location for hadoop.tmp.dir than the other two nodes and
refuses to start the datanode when started in the cluster. When I change
the location pointed to by hadoop.tmp.dir to be the same on all machines it
starts up fine on all machines.

Shouldn't I be able to have the master and slave1 set as:

hadoop.tmp.dir
/app/hadoop/tmp
A base for other temporary directories.


And slave2 set as:

hadoop.tmp.dir
/media/d2/app/hadoop/**tmp
A base for other temporary directories.


??? Slave2 runs standalone in single node mode just fine. Using 0.20.205.



Re: While Running in cloudera version of hadoop getting error

2012-05-24 Thread Marcos Ortiz

Why don´t use the same Hadoop version in both clusters?
It will brings to you minor troubles.


On 05/24/2012 02:26 PM, samir das mohapatra wrote:

Hi
   I created application jar and i was trying to run in 2 node cluster using
cludera  .20 version  , it was running fine,
But when i am running that same jar in Deployment server (Cloudera version
.20.x ) having 40 node cluster I am getting error

cloude any one please help me with this.

12/05/24 09:39:09 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.

Like this says here, you should implement Tool for your MapReduce Job


12/05/24 09:39:10 INFO mapred.FileInputFormat: Total input paths to process
: 1

12/05/24 09:39:10 INFO mapred.JobClient: Running job: job_201203231049_12426

12/05/24 09:39:11 INFO mapred.JobClient:  map 0% reduce 0%

12/05/24 09:39:20 INFO mapred.JobClient: Task Id :
attempt_201203231049_12426_m_00_0, Status : FAILED

java.lang.RuntimeException: Error in configuring object

 at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)

 at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)

 at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)

 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387)

 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)

 at org.apache.hadoop.mapred.Child$4.run(Child.java:270)

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Subject.java:396)

 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)

 at org.apache.hadoop.mapred.Child.main(Child.java:264)

Caused by: java.lang.reflect.InvocationTargetException

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav

attempt_201203231049_12426_m_00_0: getDefaultExtension()

12/05/24 09:39:20 INFO mapred.JobClient: Task Id :
attempt_201203231049_12426_m_01_0, Status : FAILED



Thanks

samir





--
Marcos Luis Ortíz Valmaseda
 Data Engineer&&  Sr. System Administrator at UCI
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: 3 machine cluster trouble

2012-05-24 Thread Pat Ferrel
Oops, after a few trials I got an ERROR for incompatible builds 
versions. Copied code from the master, reformatted, et voila.


On 5/24/12 11:34 AM, Pat Ferrel wrote:
ok, so all nodes are configured the same except for master/slave 
differences. They are all running hdfs all daemons seem to be running 
when I do a start-all.sh from the master. However the master 
Map/Reduce Administration page shows only two live nodes. The HDFS 
page shows 3.


Looking at the log files on the new slave node I see no outright 
errors but see this in the tasktracker log file. All machines have 8G 
memory. I think the important part below is TaskTracker's 
totalMemoryAllottedForTasks is -1. I've searched for others with this 
problem but haven't found something for my case, which is just trying 
to startup. No tasks have been run.


2012-05-24 11:20:46,786 INFO org.apache.hadoop.mapred.TaskTracker: 
Starting tracker tracker_occam3:localhost/127.0.0.1:45700
2012-05-24 11:20:46,792 INFO org.apache.hadoop.mapred.TaskTracker: 
Starting thread: Map-events fetcher for all reduce tasks on 
tracker_occam3:localhost/127.0.0.1:45700
2012-05-24 11:20:46,792 INFO org.apache.hadoop.mapred.TaskTracker:  
Using ResourceCalculatorPlugin : 
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5abd09e8
2012-05-24 11:20:46,795 WARN org.apache.hadoop.mapred.TaskTracker: 
TaskTracker's totalMemoryAllottedForTasks is -1. TaskMemoryManager is 
disabled.
2012-05-24 11:20:46,795 INFO org.apache.hadoop.mapred.IndexCache: 
IndexCache created with max memory = 10485760
2012-05-24 11:20:46,800 INFO org.apache.hadoop.mapred.TaskTracker: 
Shutting down: Map-events fetcher for all reduce tasks on 
tracker_occam3:localhost/127.0.0.1:45700
2012-05-24 11:20:46,800 INFO 
org.apache.hadoop.filecache.TrackerDistributedCacheManager: Cleanup...

java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at 
org.apache.hadoop.filecache.TrackerDistributedCacheManager$CleanupThread.run(TrackerDistributedCacheManager.java:926)
2012-05-24 11:20:46,900 INFO org.apache.hadoop.ipc.Server: Stopping 
server on 45700
2012-05-24 11:20:46,901 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 3 on 45700: exiting
2012-05-24 11:20:46,901 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 1 on 45700: exiting
2012-05-24 11:20:46,902 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 2 on 45700: exiting
2012-05-24 11:20:46,902 INFO org.apache.hadoop.ipc.Server: Stopping 
IPC Server listener on 45700
2012-05-24 11:20:46,901 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 0 on 45700: exiting
2012-05-24 11:20:46,904 INFO 
org.apache.hadoop.ipc.metrics.RpcInstrumentation: shut down
2012-05-24 11:20:46,904 INFO org.apache.hadoop.mapred.TaskTracker: 
Shutting down StatusHttpServer
2012-05-24 11:20:46,904 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 7 on 45700: exiting
2012-05-24 11:20:46,903 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 6 on 45700: exiting
2012-05-24 11:20:46,903 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 4 on 45700: exiting
2012-05-24 11:20:46,904 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 5 on 45700: exiting
2012-05-24 11:20:46,904 INFO org.apache.hadoop.ipc.Server: Stopping 
IPC Server Responder
2012-05-24 11:20:46,909 INFO org.mortbay.log: Stopped 
SelectChannelConnector@0.0.0.0:50060




On 5/23/12 3:55 PM, James Warren wrote:

Hi Pat -

The setting for hadoop.tmp.dir is used both locally and on HDFS and
therefore should be consistent across your cluster.

http://stackoverflow.com/questions/2354525/what-should-be-hadoop-tmp-dir

cheers,
-James

On Wed, May 23, 2012 at 3:44 PM, Pat Ferrel  
wrote:


I have a two machine cluster and am adding a new machine. The new 
node has

a different location for hadoop.tmp.dir than the other two nodes and
refuses to start the datanode when started in the cluster. When I 
change
the location pointed to by hadoop.tmp.dir to be the same on all 
machines it

starts up fine on all machines.

Shouldn't I be able to have the master and slave1 set as:

hadoop.tmp.dir
/app/hadoop/tmp
A base for other temporary directories.


And slave2 set as:

hadoop.tmp.dir
/media/d2/app/hadoop/**tmp
A base for other temporary directories.


??? Slave2 runs standalone in single node mode just fine. Using 
0.20.205.