Re: MBean for mapred context

2011-09-27 Thread zhang SunMoonStar
i have used jmx to monitor the topics and queues hosted in message server. i am a new hadooper. so i am very interested in this topic. i think that, if the map and reduce tasks are wrapped into MBean, we can easily monitor the tasks' status. 2011/9/27 patrick sang silvianhad...@gmail.com hi

libhdfs: 32 bit jvm on 64 bit machine

2011-09-27 Thread Vivek K
Hi all, I have a 32-bit binary that uses libhdfs for accessing hdfs (on the cloudera VM) and am trying to run it on cluster with 64-bit machines. But unfortunately it crashes with error while loading shared libraries: libjvm.so: wrong ELF class: ELFCLASS64. (libhdfs needs libjvm.so). I tried

Re: libhdfs: 32 bit jvm on 64 bit machine

2011-09-27 Thread Vivek K
Hi Brian Thanks for a prompt response. The machines on cluster didn't have libhdfs.so.0 file. So I copied my libhdfs.so (that came with cloudera vm - libhdfs0 and libhdfs0-dev) on the cluster machine. So it should be 32-bit. The wrong ELF class error pops up when I try to use the libjvm.so on

Re: libhdfs: 32 bit jvm on 64 bit machine

2011-09-27 Thread Vivek K
Here is the output of file libhdfs.so: libhdfs.so.0: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), dynamically linked, stripped Vivek -- On Tue, Sep 27, 2011 at 10:24 AM, Vivek K hadoop.v...@gmail.com wrote: Hi Brian Thanks for a prompt response. The machines on cluster

Re: libhdfs: 32 bit jvm on 64 bit machine

2011-09-27 Thread Brian Bockelman
On Sep 27, 2011, at 9:24 AM, Vivek K wrote: Hi Brian Thanks for a prompt response. The machines on cluster didn't have libhdfs.so.0 file. So I copied my libhdfs.so (that came with cloudera vm - libhdfs0 and libhdfs0-dev) on the cluster machine. So it should be 32-bit. The wrong ELF

Re: libhdfs: 32 bit jvm on 64 bit machine

2011-09-27 Thread Brian Bockelman
Hi Vivek, That's a difficult question to answer due to the vagaries of Java on Linux distros (I could probably give an answer valid on SL5.7, but nothing else). You'll need to work that out with your sysadmin. I think the answer should be yes, but depending on your distribution, that yes may

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
The problem is the step 4 in the breaking sequence. Currently the TaskTracker never looks at the disk to know if a file is in the distributed cache or not. It assumes that if it downloaded the file and did not delete that file itself then the file is still there in its original form. It does

Re: Getting the cpu, memory usage of map/reduce tasks

2011-09-27 Thread bikash sharma
Thanks Raif. On Mon, Sep 26, 2011 at 2:01 PM, Ralf Heyde ralf.he...@gmx.de wrote: Hi Bikash, every map-/reduce task is - as far as I know - a single jvm instance - you can configure and/or run with jvm options. Maybe you can track these jvm's by using some system tools. Regards, Ralf

configuring different number of slaves for MR jobs

2011-09-27 Thread bikash sharma
Hi -- Can we specify a different set of slaves for each mapreduce job run. I tried using the --config option and specify different set of slaves in slaves config file. However, it does not use the selective slaves set but the one initially configured. Any help? Thanks, Biksah

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Meng Mao
Who is in charge of getting the files there for the first time? The addCacheFile call in the mapreduce job? Or a manual setup by the user/operator? On Tue, Sep 27, 2011 at 11:35 AM, Robert Evans ev...@yahoo-inc.com wrote: The problem is the step 4 in the breaking sequence. Currently the

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
addCacheFile sets a config value in your jobConf that indicates which files your particular job depends on. When the TaskTracker is assigned to run part of your job (map task or reduce task), it will download your jobConf, read it in, and then download the files listed in the conf, if it has

Re: configuring different number of slaves for MR jobs

2011-09-27 Thread Vitthal Suhas Gogate
Slaves file is used only by control scripts like {start/stop}-dfs.sh, {start/stop}-mapred.sh to start the data nodes and task trackers on specified set of slave machines.. they can not be used effectively to change the size of the cluster for each M/R job (unless you want to restart the task

Re: configuring different number of slaves for MR jobs

2011-09-27 Thread bikash sharma
Thanks Suhas. I will try using HOD. The use case for me is some research experiments with different set of slaves for each job run. On Tue, Sep 27, 2011 at 1:03 PM, Vitthal Suhas Gogate gog...@hortonworks.com wrote: Slaves file is used only by control scripts like {start/stop}-dfs.sh,

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
If you are never ever going to use that file again for any map/reduce task in the future then yes you can delete it, but I would not recommend it. If you want to reduce the amount of space that is used by the distributed cache there is a config parameter for that. local.cache.size it is the

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Meng Mao
I'm not concerned about disk space usage -- the script we used that deleted the taskTracker cache path has been fixed not to do so. I'm curious about the exact behavior of jobs that use DistributedCache files. Again, it seems safe from your description to delete files between completed runs. How

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
Yes, all of the state for the task tracker is in memory. It never looks at the disk to see what is there, it only maintains the state in memory. --bobby Evans On 9/27/11 1:00 PM, Meng Mao meng...@gmail.com wrote: I'm not concerned about disk space usage -- the script we used that deleted the

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Meng Mao
So the proper description of how DistributedCache normally works is: 1. have files to be cached sitting around in HDFS 2. Run Job A, which specifies those files to be put into DistributedCache space. Each worker node copies the to-be-cached files from HDFS to local disk, but more importantly, the

Re: Environment consideration for a research on scheduling

2011-09-27 Thread Merto Mertek
Desktop edition was chosen just to run the namemode and to monitor cluster statistics. Workernodes were chosen to run on ubuntu server edition because we find this configuration in several research papers. One of such configuration can be found in the paper for LATE scheduler (is maybe some source

Re: operation of DistributedCache following manual deletion of cached files?

2011-09-27 Thread Robert Evans
That is correct, However, it is a bit more complicated then that. The Task Tracker's in memory index of the distributed cache is keyed off of the path of the file and the HDFS creation time of the file. So if you delete the original file off of HDFS, and then recreate it with a new time

RE: Temporary Files to be sent to DistributedCache

2011-09-27 Thread GOEKE, MATTHEW (AG/1000)
The simplest route I can think of is to ingest the data directly into HDFS using Sqoop if there is a driver currently made for your database. At that point it would be relatively simple just to read directly from HDFS in your MR code. Matt -Original Message- From: lessonz

Scheduler Algorithm in Hadoop

2011-09-27 Thread Amal De Silva
Hi I am a scheduling/optimization algorithm expert with many years of experience (in applying scheduling to manufacturing and transportation industry). I want to see if I can improve the scheduling algorithms in Hadoop like the Fair Scheduler. However, I am struggling to get a basic hadoop

Re: Scheduler Algorithm in Hadoop

2011-09-27 Thread Arun C Murthy
Amal, Welcome to Hadoop! Currently we have two, very different, versions of MapReduce: MRv1 and MRv2. Most of the active developers on MapReduce are working on MRv2. If you want to take a look please see the sources under trunk/branch-0.23:

Re: difference between development and production platform???

2011-09-27 Thread Arko Provo Mukherjee
Hi, A development platform is the system (s) which are used mainly for the developers to write / unit test code for the project. There are generally NO end users in the Development system. Production platform is where the end users actually work and the project is generally moved here only

Re: difference between development and production platform???

2011-09-27 Thread Hamedani, Masoud
Special Thanks for your help Arko, You mean in Hadoop, NameNode, DataNodes, JobTracker, TaskTrackers and all the clusters should deployed on Linux machines??? We have lots of data (on windows OS) and code (written in C#) for data mining, we wana to use Hadoop and make connection between our

Re: DataBlockScanner

2011-09-27 Thread Linden Hillenbrand
Bourne, There is only one datanode? The Verification succeeded messages are from a Datanode background housekeeping task, DataBlockScanner, which attempts to discover any replicas that have become corrupt. If it finds one (which should be rare), it tells the Namenode the replica has become

Re: difference between development and production platform???

2011-09-27 Thread Linden Hillenbrand
Currently Windows is not a supported production platform for Hadoop. You should run all of your daemons on Linux machines. You can move your data to HDFS on those nodes easily, the C# piece you can use Hadoop Streaming ( http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Hadoop+Streaming)

Re: Temporary Files to be sent to DistributedCache

2011-09-27 Thread lessonz
So, I thought about that, and I'd considered writing to the HDFS and then copying the file into the DistributedCache so each mapper/reducer doesn't have to reach into the HDFS for these files. Is that the best way to handle this? On Tue, Sep 27, 2011 at 4:01 PM, GOEKE, MATTHEW (AG/1000)

Subscribe to List

2011-09-27 Thread Yu Yang
hello, Subscribe to List common-user-subscr...@hadoop.apache.org thx

Re: Temporary Files to be sent to DistributedCache

2011-09-27 Thread Linden Hillenbrand
Most likely the easiest and fastest way as you will be leveraging the distributed ingestion of Sqoop, rather than a single-thread import some other way. On Wed, Sep 28, 2011 at 12:27 AM, lessonz less...@q.com wrote: So, I thought about that, and I'd considered writing to the HDFS and then

Re: difference between development and production platform???

2011-09-27 Thread Arko Provo Mukherjee
Hi, You necessarily don't need to execute the C# codes on Linux. You can write a middleware application to bring the data from the Win boxes to the Linux (Hadoop) boxes if you want to. Cheers Arko On Tue, Sep 27, 2011 at 10:19 PM, Hamedani, Masoud mas...@agape.hanyang.ac.kr wrote: Special

Re: difference between development and production platform???

2011-09-27 Thread Linden Hillenbrand
Hadoop Streaming :) On Wed, Sep 28, 2011 at 12:30 AM, Arko Provo Mukherjee arkoprovomukher...@gmail.com wrote: Hi, You necessarily don't need to execute the C# codes on Linux. You can write a middleware application to bring the data from the Win boxes to the Linux (Hadoop) boxes if you

Re: difference between development and production platform???

2011-09-27 Thread Hamedani, Masoud
Thanks for your nice help Arko, maybe because im new in hadoop i cant get some of points, im studying hadoop manual more deeply to have better info. B.S Masoud. 2011/9/28 Arko Provo Mukherjee arkoprovomukher...@gmail.com Hi, You necessarily don't need to execute the C# codes on Linux. You

Reduce shuffle bytes in GUI

2011-09-27 Thread john smith
Hey folks, I have my job tracker GUI which shows a lot of information about the running/completed jobs. I am interested in the field Reduce shuffle bytes. I want to know how it is computed... Is it just the sum of all the bytes received per reducer during shuffle ? Any help? Thanks