RE: Right way to implement MR ?
Samir, Depends upon your data format and what you want to achieve. More information could have helped -Shailesh. -Original Message- From: samir das mohapatra [mailto:samir.help...@gmail.com] Sent: Thursday, May 24, 2012 1:17 AM To: common-user@hadoop.apache.org Subject: Right way to implement MR ? Hi All, How to compare to input file In M/R Job. let A Log file around 30GB and B Log file size is around 60 GB I wanted to know how i will define inside the mapper. Thanks samir.
Re: Right way to implement MR ?
Thanks Harsh J for your help. On Thu, May 24, 2012 at 1:24 AM, Harsh J wrote: > Samir, > > You can use MultipleInputs for multiple forms of inputs per mapper > (with their own input K/V types, but common output K/V types) with a > common reduce-side join/compare. > > See > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html > . > > On Thu, May 24, 2012 at 1:17 AM, samir das mohapatra > wrote: > > Hi All, > > How to compare to input file In M/R Job. > > let A Log file around 30GB > >and B Log file size is around 60 GB > > > > I wanted to know how i will define inside the mapper. > > > > Thanks > > samir. > > > > -- > Harsh J >
Re: 3 machine cluster trouble
ok, so all nodes are configured the same except for master/slave differences. They are all running hdfs all daemons seem to be running when I do a start-all.sh from the master. However the master Map/Reduce Administration page shows only two live nodes. The HDFS page shows 3. Looking at the log files on the new slave node I see no outright errors but see this in the tasktracker log file. All machines have 8G memory. I think the important part below is TaskTracker's totalMemoryAllottedForTasks is -1. I've searched for others with this problem but haven't found something for my case, which is just trying to startup. No tasks have been run. 2012-05-24 11:20:46,786 INFO org.apache.hadoop.mapred.TaskTracker: Starting tracker tracker_occam3:localhost/127.0.0.1:45700 2012-05-24 11:20:46,792 INFO org.apache.hadoop.mapred.TaskTracker: Starting thread: Map-events fetcher for all reduce tasks on tracker_occam3:localhost/127.0.0.1:45700 2012-05-24 11:20:46,792 INFO org.apache.hadoop.mapred.TaskTracker: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5abd09e8 2012-05-24 11:20:46,795 WARN org.apache.hadoop.mapred.TaskTracker: TaskTracker's totalMemoryAllottedForTasks is -1. TaskMemoryManager is disabled. 2012-05-24 11:20:46,795 INFO org.apache.hadoop.mapred.IndexCache: IndexCache created with max memory = 10485760 2012-05-24 11:20:46,800 INFO org.apache.hadoop.mapred.TaskTracker: Shutting down: Map-events fetcher for all reduce tasks on tracker_occam3:localhost/127.0.0.1:45700 2012-05-24 11:20:46,800 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Cleanup... java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.filecache.TrackerDistributedCacheManager$CleanupThread.run(TrackerDistributedCacheManager.java:926) 2012-05-24 11:20:46,900 INFO org.apache.hadoop.ipc.Server: Stopping server on 45700 2012-05-24 11:20:46,901 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 45700: exiting 2012-05-24 11:20:46,901 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 45700: exiting 2012-05-24 11:20:46,902 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 45700: exiting 2012-05-24 11:20:46,902 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 45700 2012-05-24 11:20:46,901 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 45700: exiting 2012-05-24 11:20:46,904 INFO org.apache.hadoop.ipc.metrics.RpcInstrumentation: shut down 2012-05-24 11:20:46,904 INFO org.apache.hadoop.mapred.TaskTracker: Shutting down StatusHttpServer 2012-05-24 11:20:46,904 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 45700: exiting 2012-05-24 11:20:46,903 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 45700: exiting 2012-05-24 11:20:46,903 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 45700: exiting 2012-05-24 11:20:46,904 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 45700: exiting 2012-05-24 11:20:46,904 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder 2012-05-24 11:20:46,909 INFO org.mortbay.log: Stopped SelectChannelConnector@0.0.0.0:50060 On 5/23/12 3:55 PM, James Warren wrote: Hi Pat - The setting for hadoop.tmp.dir is used both locally and on HDFS and therefore should be consistent across your cluster. http://stackoverflow.com/questions/2354525/what-should-be-hadoop-tmp-dir cheers, -James On Wed, May 23, 2012 at 3:44 PM, Pat Ferrel wrote: I have a two machine cluster and am adding a new machine. The new node has a different location for hadoop.tmp.dir than the other two nodes and refuses to start the datanode when started in the cluster. When I change the location pointed to by hadoop.tmp.dir to be the same on all machines it starts up fine on all machines. Shouldn't I be able to have the master and slave1 set as: hadoop.tmp.dir /app/hadoop/tmp A base for other temporary directories. And slave2 set as: hadoop.tmp.dir /media/d2/app/hadoop/**tmp A base for other temporary directories. ??? Slave2 runs standalone in single node mode just fine. Using 0.20.205.
Re: While Running in cloudera version of hadoop getting error
Why don´t use the same Hadoop version in both clusters? It will brings to you minor troubles. On 05/24/2012 02:26 PM, samir das mohapatra wrote: Hi I created application jar and i was trying to run in 2 node cluster using cludera .20 version , it was running fine, But when i am running that same jar in Deployment server (Cloudera version .20.x ) having 40 node cluster I am getting error cloude any one please help me with this. 12/05/24 09:39:09 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. Like this says here, you should implement Tool for your MapReduce Job 12/05/24 09:39:10 INFO mapred.FileInputFormat: Total input paths to process : 1 12/05/24 09:39:10 INFO mapred.JobClient: Running job: job_201203231049_12426 12/05/24 09:39:11 INFO mapred.JobClient: map 0% reduce 0% 12/05/24 09:39:20 INFO mapred.JobClient: Task Id : attempt_201203231049_12426_m_00_0, Status : FAILED java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav attempt_201203231049_12426_m_00_0: getDefaultExtension() 12/05/24 09:39:20 INFO mapred.JobClient: Task Id : attempt_201203231049_12426_m_01_0, Status : FAILED Thanks samir -- Marcos Luis Ortíz Valmaseda Data Engineer&& Sr. System Administrator at UCI http://marcosluis2186.posterous.com http://www.linkedin.com/in/marcosluis2186 Twitter: @marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: 3 machine cluster trouble
Oops, after a few trials I got an ERROR for incompatible builds versions. Copied code from the master, reformatted, et voila. On 5/24/12 11:34 AM, Pat Ferrel wrote: ok, so all nodes are configured the same except for master/slave differences. They are all running hdfs all daemons seem to be running when I do a start-all.sh from the master. However the master Map/Reduce Administration page shows only two live nodes. The HDFS page shows 3. Looking at the log files on the new slave node I see no outright errors but see this in the tasktracker log file. All machines have 8G memory. I think the important part below is TaskTracker's totalMemoryAllottedForTasks is -1. I've searched for others with this problem but haven't found something for my case, which is just trying to startup. No tasks have been run. 2012-05-24 11:20:46,786 INFO org.apache.hadoop.mapred.TaskTracker: Starting tracker tracker_occam3:localhost/127.0.0.1:45700 2012-05-24 11:20:46,792 INFO org.apache.hadoop.mapred.TaskTracker: Starting thread: Map-events fetcher for all reduce tasks on tracker_occam3:localhost/127.0.0.1:45700 2012-05-24 11:20:46,792 INFO org.apache.hadoop.mapred.TaskTracker: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5abd09e8 2012-05-24 11:20:46,795 WARN org.apache.hadoop.mapred.TaskTracker: TaskTracker's totalMemoryAllottedForTasks is -1. TaskMemoryManager is disabled. 2012-05-24 11:20:46,795 INFO org.apache.hadoop.mapred.IndexCache: IndexCache created with max memory = 10485760 2012-05-24 11:20:46,800 INFO org.apache.hadoop.mapred.TaskTracker: Shutting down: Map-events fetcher for all reduce tasks on tracker_occam3:localhost/127.0.0.1:45700 2012-05-24 11:20:46,800 INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Cleanup... java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.filecache.TrackerDistributedCacheManager$CleanupThread.run(TrackerDistributedCacheManager.java:926) 2012-05-24 11:20:46,900 INFO org.apache.hadoop.ipc.Server: Stopping server on 45700 2012-05-24 11:20:46,901 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 45700: exiting 2012-05-24 11:20:46,901 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 45700: exiting 2012-05-24 11:20:46,902 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 45700: exiting 2012-05-24 11:20:46,902 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 45700 2012-05-24 11:20:46,901 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 45700: exiting 2012-05-24 11:20:46,904 INFO org.apache.hadoop.ipc.metrics.RpcInstrumentation: shut down 2012-05-24 11:20:46,904 INFO org.apache.hadoop.mapred.TaskTracker: Shutting down StatusHttpServer 2012-05-24 11:20:46,904 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 45700: exiting 2012-05-24 11:20:46,903 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 45700: exiting 2012-05-24 11:20:46,903 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 45700: exiting 2012-05-24 11:20:46,904 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 45700: exiting 2012-05-24 11:20:46,904 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder 2012-05-24 11:20:46,909 INFO org.mortbay.log: Stopped SelectChannelConnector@0.0.0.0:50060 On 5/23/12 3:55 PM, James Warren wrote: Hi Pat - The setting for hadoop.tmp.dir is used both locally and on HDFS and therefore should be consistent across your cluster. http://stackoverflow.com/questions/2354525/what-should-be-hadoop-tmp-dir cheers, -James On Wed, May 23, 2012 at 3:44 PM, Pat Ferrel wrote: I have a two machine cluster and am adding a new machine. The new node has a different location for hadoop.tmp.dir than the other two nodes and refuses to start the datanode when started in the cluster. When I change the location pointed to by hadoop.tmp.dir to be the same on all machines it starts up fine on all machines. Shouldn't I be able to have the master and slave1 set as: hadoop.tmp.dir /app/hadoop/tmp A base for other temporary directories. And slave2 set as: hadoop.tmp.dir /media/d2/app/hadoop/**tmp A base for other temporary directories. ??? Slave2 runs standalone in single node mode just fine. Using 0.20.205.