Re: How to setup Hive on a single node ?
Thanks for your reply ! I've already installed Hive correctly. First, i installed CDH3https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation, unfortunately i use Ubuntu Oneiric but CDH don't support Oneiric, so, i download and install the CDH3 package for Lucid system. Then, i installed Hadoop via the following command : ... *sudo apt-get install hadoop-0.20 * Next, i installed Hive via the following command : ... *sudo apt-get install hadoop-hive* ** After that, i installed Mysqlhttp://ariejan.net/2007/12/12/how-to-install-mysql-on-ubuntudebian Finally, i configure Hivehttps://ccp.cloudera.com/display/CDHDOC/Hive+Installationand some variableshttps://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration (conf/hive-site.xml) + Hive - Hadoop : hadoop.bin.path, hadoop.config.dir + Hive - Mysql : hive.metastore.warehouse.dir Now, it run correctly. Thanks so much ! 2012/2/9 hadoop hive hadooph...@gmail.com hey Lac, its showing like you dont have DBS table in metastore(derby or mysql), actually you have to again install the hive or again build hive through ANT. Check you metastore(that DBS is exists or not) Thanks regards Vikas Srivastava On Fri, Feb 10, 2012 at 8:33 AM, Lac Trung trungnb3...@gmail.com wrote: Thanks for your reply ! I think i installed Hadoop correctly because i run wordcount example i have correct output. But i didn't know how to install Hive, so i installed Hive via https://cwiki.apache.org/confluence/display/Hive/GettingStartedinclude installed Hadoop 20.0 (may be not install on single node) ^_^ I configured like the instruction in hive-site.xml but error like this : hive show tables; FAILED: Error in metadata: javax.jdo.JDODataStoreException: Required table missing : `DBS` in Catalog Schema . DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable datanucleus.autoCreateTables NestedThrowables: org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : `DBS` in Catalog Schema . DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable datanucleus.autoCreateTables FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask I didn't know what to do so I reinstall Ubuntu to install from start and hope that someone can show me the way to do. -- Lạc Trung -- Lạc Trung 20083535
Combining MultithreadedMapper threadpool size map.tasks.maximum
I'm looking to clarify the relationship between MultithreadedMapper.setNumberOfThreads(i) and mapreduce.tasktracker.map.tasks.maximum . If I set: - MultithreadedMapper.setNumberOfThreads( 4 ) - mapreduce.tasktracker.map.tasks.maximum = 1 Will 4 map tasks be executed in four separate threads within one JVM ? Or are the number of threads also restricted by the map.tasks.maximum parameter? What about if I set: - MultithreadedMapper.setNumberOfThreads( 4 ) - mapreduce.tasktracker.map.tasks.maximum = 4 Will this mean that 4 map tasks are executed in 4 threads in one JVM, or will it mean that 4 JVMs be instantiated, each executing 4 map tasks in individual threads? thanks,
Re: Combining MultithreadedMapper threadpool size map.tasks.maximum
Hi Rob, On Fri, Feb 10, 2012 at 5:55 PM, Rob Stewart robstewar...@gmail.com wrote: I'm looking to clarify the relationship between MultithreadedMapper.setNumberOfThreads(i) and mapreduce.tasktracker.map.tasks.maximum . The former is an in-user-application value that controls the total number of threads to run for map() calls (inside a mapper). This is _inside_ one JVM (a task, in hadoop terms, is one complete JVM running user code). The latter controls, at a TaskTracker level, the max total number of map-task JVMs that it can run concurrently at any given time. What about if I set: - MultithreadedMapper.setNumberOfThreads( 4 ) - mapreduce.tasktracker.map.tasks.maximum = 4 Will this mean that 4 map tasks are executed in 4 threads in one JVM, or will it mean that 4 JVMs be instantiated, each executing 4 map tasks in individual threads? 4 JVMs if you have 4 tasks in your Job (# of map tasks of a job is dependent on its input). Each JVM will then run the MultithreadedMapper code, which will then run 4 threads to call your map() inside of it cause you've asked that of it. -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about
Re: Combining MultithreadedMapper threadpool size map.tasks.maximum
hi Harsh, On 10 February 2012 12:42, Harsh J ha...@cloudera.com wrote: 4 JVMs if you have 4 tasks in your Job (# of map tasks of a job is dependent on its input). Each JVM will then run the MultithreadedMapper code, which will then run 4 threads to call your map() inside of it cause you've asked that of it. So.. the MultithreadedMapper class splits *one* map task into N number of threads? How is this achieved? I wasn't aware that a map task could be implicitly sub-divided implicitly? I was under the (false?) impression that the purpose of a MultithreadedMapper enabled the opportunity to send N number of independent map tasks to be forked as threads. ? Also, from what you say.. if you have map.tasks.maximum = 4 and setNumberOfThreads(4), then in all, for each compute node, up to 16 threads could be forked at any one time? I'm trying to identify the performance penalty or performance benefit of achieving node concurrency with threads, rather than multiple JVMs. I and I was hoping that setting map.tasks.maximum = 1, and setNumberOfThreads( #cores ), I would achieve that. Maybe not? thanks, -- Rob
Re: Combining MultithreadedMapper threadpool size map.tasks.maximum
Rob, On Fri, Feb 10, 2012 at 6:32 PM, Rob Stewart robstewar...@gmail.com wrote: So.. the MultithreadedMapper class splits *one* map task into N number of threads? How is this achieved? I wasn't aware that a map task could be implicitly sub-divided implicitly? I was under the (false?) impression that the purpose of a MultithreadedMapper enabled the opportunity to send N number of independent map tasks to be forked as threads. ? Imagine writing your own Mapper code that runs threads to do some processing when beginning the map() process. MultithreadedMapper is just an abstraction of something like that, provided for developer convenience. It makes no relationship with task, task scheduling, or any other thing higher up in the framework. Does that make it clear? Also, from what you say.. if you have map.tasks.maximum = 4 and setNumberOfThreads(4), then in all, for each compute node, up to 16 threads could be forked at any one time? Yeah you'd be running, at maximum, 4 JVMs, each with 4 threads inside it. I'm trying to identify the performance penalty or performance benefit of achieving node concurrency with threads, rather than multiple JVMs. I and I was hoping that setting map.tasks.maximum = 1, and setNumberOfThreads( #cores ), I would achieve that. Maybe not? What you're missing to see here is that the multithreaded mapper is something that runs as part of one single map task. Each map task has a defined input split from which it reads off keys and values to map() calls. With just one JVM slot, you'd end up processing only one input-chunk at a time, though with 4 threads doing map() computation, while with four slots, you may be processing 4 input-chunks (4 tasks) at the same time. The choice between the two has to be application-sensitive. If your work were IO intensive, the slot approach would win at parallelism. Using single slot with 4 threads when the map() computation is cheap would be a waste of time you could instead do more IO with parallel tasks. But if your work were more CPU intensive, where each map() may take a long time to run before moving to next, then MTMapper with a set amount of threads may make more sense to use. -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about
Re: Combining MultithreadedMapper threadpool size map.tasks.maximum
Harsh, On 10 February 2012 13:33, Harsh J ha...@cloudera.com wrote: What you're missing to see here is that the multithreaded mapper is something that runs as part of one single map task. With just one JVM slot, you'd end up processing only one input-chunk at a time, though with 4 threads doing map() computation, while with four slots, you may be processing 4 input-chunks (4 tasks) at the same time. The choice between the two has to be application-sensitive. OK, take word count. The k,v to the map is null,foo bar lambda beta. The canonical Hadoop program would tokenize this line of text and output foo,1 and so on. How would the multithreadedmapper know how to further divide this line of text into, say: [null,foo bar,null,lambda beta] for 2 threads to run in parallel? Can you somehow provide an additional record reader to split the input to the map task into sub-inputs for each thread? If your work were IO intensive, the slot approach would win at parallelism. Are you saying here that 4 single-threaded OS processes can achieve a higher rate of OS IO, than 4 threads within one OS process doing IO (which would sound sensible if that's the case). Using single slot with 4 threads when the map() computation is cheap would be a waste of time you could instead do more IO with parallel tasks. The argument against this approach is that the cost starting up OS processes is far more expensive that forking threads within processes. So I would have said the contrary - where map tasks are small and input size is large, than many JVMs would be instantiated throughout the system, one per task. Instead, one might speculate that reducing the number of JVMs, replacing with lower latency thread forking would improve runtime speeds. ? But if your work were more CPU intensive, where each map() may take a long time to run before moving to next, then MTMapper with a set amount of threads may make more sense to use. OK, so are you saying: - For CPU intensive tasks, multiple threads might help - For IO intensive tasks, multiple OS processes achieve higher throughput than multiple threads within a smaller number of OS processes? Thanks, -- Rob
Re: Combining MultithreadedMapper threadpool size map.tasks.maximum
Hello again, On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart robstewar...@gmail.com wrote: OK, take word count. The k,v to the map is null,foo bar lambda beta. The canonical Hadoop program would tokenize this line of text and output foo,1 and so on. How would the multithreadedmapper know how to further divide this line of text into, say: [null,foo bar,null,lambda beta] for 2 threads to run in parallel? Can you somehow provide an additional record reader to split the input to the map task into sub-inputs for each thread? In MultithreadedMapper, the IO work is still single threaded, while the map() calling post-read is multithreaded. But yes you could use a mix of CombineFileInputFormat and some custom logic to have multiple local splits per map task, and divide readers of them among your threads. But why do all this when thats what slots at the TT are for? The cost of a single map task failure with your mammoth task approach would also be higher - more work to repeat. Are you saying here that 4 single-threaded OS processes can achieve a higher rate of OS IO, than 4 threads within one OS process doing IO (which would sound sensible if that's the case). Yeah thats what I meant, but with the earlier point of In MultithreadedMapper, the IO work is still single threaded specifically in mind. The argument against this approach is that the cost starting up OS processes is far more expensive that forking threads within processes. So I would have said the contrary - where map tasks are small and input size is large, than many JVMs would be instantiated throughout the system, one per task. Instead, one might speculate that reducing the number of JVMs, replacing with lower latency thread forking would improve runtime speeds. ? Agreed here. The JVM startup overhead does exist but I wouldn't think its too high a cost overall, given the simple benefits it can provide instead. There is also JVM reuse which makes sense to use for CPU intensive applications, so you can take advantage of the HotSpot features of the JVM as it gets reused for running tasks of the same job. OK, so are you saying: - For CPU intensive tasks, multiple threads might help - For IO intensive tasks, multiple OS processes achieve higher throughput than multiple threads within a smaller number of OS processes? Yep, but also if you limit your total slots to 1 in favor of going all for multi-threading, you won't be able to smoothly run multiple jobs at the same time. Tasks from new jobs may have to wait longer to run, while in regular slotted environments this is easier to achieve. -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about
Hadoop 0.21.0 streaming giving no status information
Hi, I'm trying to upgrade an application previously written for Hadoop 0.20.0 for 0.21.0. I'm running into an issue with the status output missing which is making it difficult to get the jobid/success status: hadoop/bin/hadoop jar hadoop/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar -D mapreduce.job.reduces=0 -input file:///dev/null -mapper ./cmd.sh -file ./cmd.sh -output '/users/foo/tmp/job-1234' -verbose OUTPUT 21 This gives me a bunch of settings output such as: STREAM: net.topology.script.number.args=100 STREAM: s3.blocksize=67108864 STREAM: s3.bytes-per-checksum=512 STREAM: s3.client-write-packet-size=65536 STREAM: s3.replication=3 STREAM: s3.stream-buffer-size=4096 finally ending with: STREAM: webinterface.private.actions=false STREAM: STREAM: submitting to jobconf:machine.hostname.domain:8023 After that, I get no further status information. The job does complete successfully. I would expect to get this type status information: 11/04/23 01:03:24 INFO streaming.StreamJob: getLocalDirs(): [/home/hadoop/hadoop/tmp/dir/hadoop-hadoop/mapred/local] 11/04/23 01:03:24 INFO streaming.StreamJob: Running job: job_201104222325_0021 11/04/23 01:03:24 INFO streaming.StreamJob: To kill this job, run: 11/04/23 01:03:24 INFO streaming.StreamJob: /home/hadoop/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201104222325_0021 11/04/23 01:03:24 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201104222325_0021 11/04/23 01:03:25 INFO streaming.StreamJob: map 0% reduce 0% 11/04/23 01:03:31 INFO streaming.StreamJob: map 50% reduce 0% 11/04/23 01:03:41 INFO streaming.StreamJob: map 50% reduce 17% 11/04/23 01:03:56 INFO streaming.StreamJob: map 100% reduce 100% I've tried playing with various switches including: -Dhadoop.root.logger=INFO,console -Dhadoop.log.file=hadoop.log -Dhadoop.log.dir=$PWD but none of these make a difference. Any help would be greatly appreciated! -- - Patrick Donnelly
Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?
Hey Merto, Any luck getting the patch running on your cluster? In case you're interested, there's now a JIRA for this: https://issues.apache.org/jira/browse/HADOOP-8052. Varun On Wed, Feb 8, 2012 at 7:45 PM, Varun Kapoor rez...@hortonworks.com wrote: Your general procedure sounds correct (i.e. dropping your newly built .jar into $HD_HOME/lib/), but to make sure it's getting picked up, you should explicitly add $HD_HOME/lib/ to your exported HADOOP_CLASSPATH environment variable; here's mine, as an example: export HADOOP_CLASSPATH=.:./build/*.jar About your second point, you certainly need to copy this newly patched .jar to every node in your cluster, because my patch changes the value of a couple metrics emitted TO gmetad (FROM all the nodes in the cluster), so without copying it over to every node in the cluster, gmetad will still likely receive some bad metrics. Varun On Wed, Feb 8, 2012 at 6:19 PM, Merto Mertek masmer...@gmail.com wrote: I will need your help. Please confirm if the following procedure is right. I have a dev environment where I pimp my scheduler (no hadoop running) and a small cluster environment where the changes(jars) are deployed with some scripts, however I have never compiled the whole hadoop from source so I do not know if I am doing it right. I' ve done it as follow: a) apply a patch b) cd $HD_HOME; ant c) copy $HD_HOME/*build*/patched-core-hadoop.jar - cluster:/$HD_HOME/*lib* d) run $HD_HOME/bin/start-all.sh Is this enough? When I tried to test hadoop dfs -ls / I could see that a new jar was not loaded and instead a jar from $HD_HOME/*share*/hadoop-20.205.0.jar was taken.. Should I copy the entire hadoop folder to all nodes and reconfigure the entire cluster for the new build, or is enough if I configure it just on the node where gmetad will run? On 8 February 2012 06:33, Varun Kapoor rez...@hortonworks.com wrote: I'm so sorry, Merto - like a silly goose, I attached the 2 patches to my reply, and of course the mailing list did not accept the attachment. I plan on opening JIRAs for this tomorrow, but till then, here are links to the 2 patches (from my Dropbox account): - http://dl.dropbox.com/u/4366344/gmetadBufferOverflow.Hadoop.patch - http://dl.dropbox.com/u/4366344/gmetadBufferOverflow.gmetad.patch Here's hoping this works for you, Varun On Tue, Feb 7, 2012 at 6:00 PM, Merto Mertek masmer...@gmail.com wrote: Varun, have I missed your link to the patches? I have tried to search them on jira but I did not find them.. Can you repost the link for these two patches? Thank you.. On 7 February 2012 20:36, Varun Kapoor rez...@hortonworks.com wrote: I'm sorry to hear that gmetad cores continuously for you guys. Since I'm not seeing that behavior, I'm going to just put out the 2 possible patches you could apply and wait to hear back from you. :) Option 1 * Apply gmetadBufferOverflow.Hadoop.patch to the relevant file ( http://svn.apache.org/viewvc/hadoop/common/branches/branch-1/src/core/org/apache/hadoop/metrics2/util/SampleStat.java?view=markupinmysetup) in your Hadoop sources and rebuild Hadoop. Option 2 * Apply gmetadBufferOverflow.gmetad.patch to gmetad/process_xml.c and rebuild gmetad. Only 1 of these 2 fixes is required, and it would help me if you could first try Option 1 and let me know if that fixes things for you. Varun On Mon, Feb 6, 2012 at 10:36 PM, mete efk...@gmail.com wrote: Same with Merto's situation here, it always overflows short time after the restart. Without the hadoop metrics enabled everything is smooth. Regards Mete On Tue, Feb 7, 2012 at 4:58 AM, Merto Mertek masmer...@gmail.com wrote: I have tried to run it but it repeats crashing.. - When you start gmetad and Hadoop is not emitting metrics, everything is peachy. Right, running just ganglia without running hadoop jobs seems stable for at least a day.. - When you start Hadoop (and it thus starts emitting metrics), gmetad cores. True, with a following error : *** stack smashing detected ***: gmetad terminated \n Segmentation fault - On my MacBookPro, it's a SIGABRT due to a buffer overflow. I believe this is happening for everyone. What I would like for you to try out are the following 2 scenarios: - Once gmetad cores, if you start it up again, does it core again? Does this process repeat ad infinitum? - On my MBP, the core is a one-time thing, and restarting gmetad after the first core makes things run perfectly smoothly. - I know others are saying this core occurs continuously, but they were all using ganglia-3.1.x, and I'm
Re: Combining MultithreadedMapper threadpool size map.tasks.maximum
Harsh... Oddly, this blog post has appeared within the last hour or so http://kickstarthadoop.blogspot.com/2012/02/enable-multiple-threads-in-mapper-aka.html -- Rob On 10 February 2012 14:20, Harsh J ha...@cloudera.com wrote: Hello again, On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart robstewar...@gmail.com wrote: OK, take word count. The k,v to the map is null,foo bar lambda beta. The canonical Hadoop program would tokenize this line of text and output foo,1 and so on. How would the multithreadedmapper know how to further divide this line of text into, say: [null,foo bar,null,lambda beta] for 2 threads to run in parallel? Can you somehow provide an additional record reader to split the input to the map task into sub-inputs for each thread? In MultithreadedMapper, the IO work is still single threaded, while the map() calling post-read is multithreaded. But yes you could use a mix of CombineFileInputFormat and some custom logic to have multiple local splits per map task, and divide readers of them among your threads. But why do all this when thats what slots at the TT are for? The cost of a single map task failure with your mammoth task approach would also be higher - more work to repeat. Are you saying here that 4 single-threaded OS processes can achieve a higher rate of OS IO, than 4 threads within one OS process doing IO (which would sound sensible if that's the case). Yeah thats what I meant, but with the earlier point of In MultithreadedMapper, the IO work is still single threaded specifically in mind. The argument against this approach is that the cost starting up OS processes is far more expensive that forking threads within processes. So I would have said the contrary - where map tasks are small and input size is large, than many JVMs would be instantiated throughout the system, one per task. Instead, one might speculate that reducing the number of JVMs, replacing with lower latency thread forking would improve runtime speeds. ? Agreed here. The JVM startup overhead does exist but I wouldn't think its too high a cost overall, given the simple benefits it can provide instead. There is also JVM reuse which makes sense to use for CPU intensive applications, so you can take advantage of the HotSpot features of the JVM as it gets reused for running tasks of the same job. OK, so are you saying: - For CPU intensive tasks, multiple threads might help - For IO intensive tasks, multiple OS processes achieve higher throughput than multiple threads within a smaller number of OS processes? Yep, but also if you limit your total slots to 1 in favor of going all for multi-threading, you won't be able to smoothly run multiple jobs at the same time. Tasks from new jobs may have to wait longer to run, while in regular slotted environments this is easier to achieve. -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about
Re: Combining MultithreadedMapper threadpool size map.tasks.maximum
Thanks, this is a lot clearer. One final question... On 10 February 2012 14:20, Harsh J ha...@cloudera.com wrote: Hello again, On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart robstewar...@gmail.com wrote: OK, take word count. The k,v to the map is null,foo bar lambda beta. The canonical Hadoop program would tokenize this line of text and output foo,1 and so on. How would the multithreadedmapper know how to further divide this line of text into, say: [null,foo bar,null,lambda beta] for 2 threads to run in parallel? Can you somehow provide an additional record reader to split the input to the map task into sub-inputs for each thread? In MultithreadedMapper, the IO work is still single threaded, while the map() calling post-read is multithreaded. But yes you could use a mix of CombineFileInputFormat and some custom logic to have multiple local splits per map task, and divide readers of them among your threads. But why do all this when thats what slots at the TT are for? I'm still unsure how the multi-threaded mapper knows how to split the input value into chunks, one chunk for each thread. There is only one example in the Hadoop 0.23 trunk that offers an example: hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/mapreduce/lib/map/TestMultithreadedMapper.java And in that source code, there is no custom logic for local splits per map task at all. Again, going back to the word count example. Given a line of text as input to a map, which comprises of 6 words. I specificy .setNumberOfThreads( 2 ), so ideally, I'd want 3 words analysed by one thread, and the 3 to the other. Is what what would happen? i.e. - I'm unsure whether the multithreadedmapper class does the splitting of inputs to map tasks... Regards,
Fwd: HELP - Problem in setting up Hadoop - Multi-Node Cluster
Dear Robin, Thanks for your valuable time and response. please find the attached namenode logs and configurations files. I am using 2 ubuntu boxes.One as master slave and other as slave. below given is the environment set-up in both the machines. : Hadoop : hadoop_0.20.2 Linux: Ubuntu Linux 10.10(master) and Ubuntu Linux 11.04(Slave) Java: java-7-oracle JAVA_HOME and HADOOP_HOME configuration is done in .bashrc file. Both the machines are in LAN and able to ping each other. IP address's of both the machines are configured in /etc/hosts. I do have SSH access to both master and slave as well. please let me know if you need any other information. Thanks in advance. Regards, Guruprasad On Thu, Feb 9, 2012 at 1:06 AM, Robin Mueller-Bady robin.mueller-b...@oracle.com wrote: Dear Guruprasad, it would be very helpful to provide details from your configuration files as well as more details on your setup. It seems to be that the connection from slave to master cannot be established (Connection reset by peer). Do you use a virtual environment, physical master/slaves or all on one machine ? Please paste also the output of kingul2 namenode logs. Regards, Robin On 02/08/12 13:06, Guruprasad B wrote: Hi, I am Guruprasad from Bangalore (India). I need help in setting up hadoop platform. I am very much new to Hadoop Platform. I am following the below given articles and I was able to set up Single-Node Cluster http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#what-we-want-to-do Now I am trying to set up http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#what-we-want-to-doNowIamtryingtosetupMulti-Node Cluster by following the below given article.http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ Below given is my setup: Hadoop : hadoop_0.20.2 Linux: Ubuntu Linux 10.10 Java: java-7-oracle I have successfully reached till the topic Starting the multi-node cluster in the above given article. When I start the HDFS/MapReduce daemons it is getting started and going down immediately both in master slave as well, please have a look at the below logs, hduser@kinigul2:/usr/local/hadoop$ bin/start-dfs.sh starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-kinigul2.out master: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-kinigul2.out slave: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-guruL.out master: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-kinigul2.out hduser@kinigul2:/usr/local/hadoop$ jps 6098 DataNode 6328 Jps 5914 NameNode 6276 SecondaryNameNode hduser@kinigul2:/usr/local/hadoop$ jps 6350 Jps I am getting below given error in slave logs: 2012-02-08 21:04:01,641 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to master/16.150.98.62:54310 failed on local exception: java.io.IOException: Connection reset by peer at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) at org.apache.hadoop.ipc.Client.call(Client.java:743) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy4.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:314) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:291) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:269) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:218) at sun.nio.ch.IOUtil.read(IOUtil.java:191) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:359) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:133) at
Re: Combining MultithreadedMapper threadpool size map.tasks.maximum
Hi Rob I'm the culprit who posted the blog. :) The topic was of my interest as well and I found the conversation informative and useful. Just thought of documenting the same as it could be useful for others as well in future. Hope you don't mind!.. Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: Rob Stewart robstewar...@gmail.com Date: Fri, 10 Feb 2012 18:30:53 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: Combining MultithreadedMapper threadpool size map.tasks.maximum Harsh... Oddly, this blog post has appeared within the last hour or so http://kickstarthadoop.blogspot.com/2012/02/enable-multiple-threads-in-mapper-aka.html -- Rob On 10 February 2012 14:20, Harsh J ha...@cloudera.com wrote: Hello again, On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart robstewar...@gmail.com wrote: OK, take word count. The k,v to the map is null,foo bar lambda beta. The canonical Hadoop program would tokenize this line of text and output foo,1 and so on. How would the multithreadedmapper know how to further divide this line of text into, say: [null,foo bar,null,lambda beta] for 2 threads to run in parallel? Can you somehow provide an additional record reader to split the input to the map task into sub-inputs for each thread? In MultithreadedMapper, the IO work is still single threaded, while the map() calling post-read is multithreaded. But yes you could use a mix of CombineFileInputFormat and some custom logic to have multiple local splits per map task, and divide readers of them among your threads. But why do all this when thats what slots at the TT are for? The cost of a single map task failure with your mammoth task approach would also be higher - more work to repeat. Are you saying here that 4 single-threaded OS processes can achieve a higher rate of OS IO, than 4 threads within one OS process doing IO (which would sound sensible if that's the case). Yeah thats what I meant, but with the earlier point of In MultithreadedMapper, the IO work is still single threaded specifically in mind. The argument against this approach is that the cost starting up OS processes is far more expensive that forking threads within processes. So I would have said the contrary - where map tasks are small and input size is large, than many JVMs would be instantiated throughout the system, one per task. Instead, one might speculate that reducing the number of JVMs, replacing with lower latency thread forking would improve runtime speeds. ? Agreed here. The JVM startup overhead does exist but I wouldn't think its too high a cost overall, given the simple benefits it can provide instead. There is also JVM reuse which makes sense to use for CPU intensive applications, so you can take advantage of the HotSpot features of the JVM as it gets reused for running tasks of the same job. OK, so are you saying: - For CPU intensive tasks, multiple threads might help - For IO intensive tasks, multiple OS processes achieve higher throughput than multiple threads within a smaller number of OS processes? Yep, but also if you limit your total slots to 1 in favor of going all for multi-threading, you won't be able to smoothly run multiple jobs at the same time. Tasks from new jobs may have to wait longer to run, while in regular slotted environments this is easier to achieve. -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about
Re: HELP - Problem in setting up Hadoop - Multi-Node Cluster
Dear Robin, Yes, it is possible. Regards, Guru On Fri, Feb 10, 2012 at 1:23 PM, Robin Mueller-Bady robin.mueller-b...@oracle.com wrote: Dear Guruprasad, is it possible to ping both machines with their hostnames ? (ping master / ping slave) ? Regards, Robin On 10.02.2012 07:46, Guruprasad B wrote: Dear Robin, Thanks for your valuable time and response. please find the attached namenode logs and configurations files. I am using 2 ubuntu boxes.One as master slave and other as slave. below given is the environment set-up in both the machines. : Hadoop : hadoop_0.20.2 Linux: Ubuntu Linux 10.10(master) and Ubuntu Linux 11.04(Slave) Java: java-7-oracle JAVA_HOME and HADOOP_HOME configuration is done in .bashrc file. Both the machines are in LAN and able to ping each other. IP address's of both the machines are configured in /etc/hosts. I do have SSH access to both master and slave as well. please let me know if you need any other information. Thanks in advance. Regards, Guruprasad On Thu, Feb 9, 2012 at 1:06 AM, Robin Mueller-Bady robin.mueller-b...@oracle.com wrote: Dear Guruprasad, it would be very helpful to provide details from your configuration files as well as more details on your setup. It seems to be that the connection from slave to master cannot be established (Connection reset by peer). Do you use a virtual environment, physical master/slaves or all on one machine ? Please paste also the output of kingul2 namenode logs. Regards, Robin On 02/08/12 13:06, Guruprasad B wrote: Hi, I am Guruprasad from Bangalore (India). I need help in setting up hadoop platform. I am very much new to Hadoop Platform. I am following the below given articles and I was able to set up Single-Node Cluster http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#what-we-want-to-do Now I am trying to set up http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#what-we-want-to-doNowIamtryingtosetupMulti-Node Cluster by following the below given article.http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ Below given is my setup: Hadoop : hadoop_0.20.2 Linux: Ubuntu Linux 10.10 Java: java-7-oracle I have successfully reached till the topic Starting the multi-node cluster in the above given article. When I start the HDFS/MapReduce daemons it is getting started and going down immediately both in master slave as well, please have a look at the below logs, hduser@kinigul2:/usr/local/hadoop$ bin/start-dfs.sh starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-kinigul2.out master: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-kinigul2.out slave: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-guruL.out master: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-kinigul2.out hduser@kinigul2:/usr/local/hadoop$ jps 6098 DataNode 6328 Jps 5914 NameNode 6276 SecondaryNameNode hduser@kinigul2:/usr/local/hadoop$ jps 6350 Jps I am getting below given error in slave logs: 2012-02-08 21:04:01,641 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to master/16.150.98.62:54310 failed on local exception: java.io.IOException: Connection reset by peer at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) at org.apache.hadoop.ipc.Client.call(Client.java:743) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy4.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:314) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:291) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:269) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:218) at sun.nio.ch.IOUtil.read(IOUtil.java:191) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:359) at
Re: Combining MultithreadedMapper threadpool size map.tasks.maximum
Hi Rob I'd try to answer this. From my understanding if you are using Multithreaded mapper on word count example with TextInputFormat and imagine you have 2 threads and 2 lines in your input split . RecordReader would read Line 1 and give it to map thread 1 and line 2 to map thread 2. So kind of identical process as defined would be happening with these two lines in parallel. This would be the default behavior. Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: Rob Stewart robstewar...@gmail.com Date: Fri, 10 Feb 2012 18:39:44 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: Combining MultithreadedMapper threadpool size map.tasks.maximum Thanks, this is a lot clearer. One final question... On 10 February 2012 14:20, Harsh J ha...@cloudera.com wrote: Hello again, On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart robstewar...@gmail.com wrote: OK, take word count. The k,v to the map is null,foo bar lambda beta. The canonical Hadoop program would tokenize this line of text and output foo,1 and so on. How would the multithreadedmapper know how to further divide this line of text into, say: [null,foo bar,null,lambda beta] for 2 threads to run in parallel? Can you somehow provide an additional record reader to split the input to the map task into sub-inputs for each thread? In MultithreadedMapper, the IO work is still single threaded, while the map() calling post-read is multithreaded. But yes you could use a mix of CombineFileInputFormat and some custom logic to have multiple local splits per map task, and divide readers of them among your threads. But why do all this when thats what slots at the TT are for? I'm still unsure how the multi-threaded mapper knows how to split the input value into chunks, one chunk for each thread. There is only one example in the Hadoop 0.23 trunk that offers an example: hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/mapreduce/lib/map/TestMultithreadedMapper.java And in that source code, there is no custom logic for local splits per map task at all. Again, going back to the word count example. Given a line of text as input to a map, which comprises of 6 words. I specificy .setNumberOfThreads( 2 ), so ideally, I'd want 3 words analysed by one thread, and the 3 to the other. Is what what would happen? i.e. - I'm unsure whether the multithreadedmapper class does the splitting of inputs to map tasks... Regards,
Where Is DataJoinMapperBase?
Hi, all, I am starting to learn advanced Map/Reduce. However, I cannot find the class DataJoinMapperBase in my downloaded Hadoop 1.0.0 and 0.20.2. So I searched on the Web and get the following link. http://www.java2s.com/Code/Jar/h/Downloadhadoop0201datajoinjar.htm From the link I got the package, hadoop-0.20.1-datajoin.jar. My question is why the package is not included in Hadoop 1.0.0 and 0.20.2? Is the correct way to get it? Thanks so much! Best regards, Bing
Re: Combining MultithreadedMapper threadpool size map.tasks.maximum
Here is what I understand The RecordReader for the MTMappert takes the input split and cycles the records among the available threads. It also ensures that the map outputs are synchronized. So what Bejoy says is what will happen for the wordcount program. Raj From: bejoy.had...@gmail.com bejoy.had...@gmail.com To: common-user@hadoop.apache.org Sent: Friday, February 10, 2012 11:15 AM Subject: Re: Combining MultithreadedMapper threadpool size map.tasks.maximum Hi Rob I'd try to answer this. From my understanding if you are using Multithreaded mapper on word count example with TextInputFormat and imagine you have 2 threads and 2 lines in your input split . RecordReader would read Line 1 and give it to map thread 1 and line 2 to map thread 2. So kind of identical process as defined would be happening with these two lines in parallel. This would be the default behavior. Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: Rob Stewart robstewar...@gmail.com Date: Fri, 10 Feb 2012 18:39:44 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: Combining MultithreadedMapper threadpool size map.tasks.maximum Thanks, this is a lot clearer. One final question... On 10 February 2012 14:20, Harsh J ha...@cloudera.com wrote: Hello again, On Fri, Feb 10, 2012 at 7:31 PM, Rob Stewart robstewar...@gmail.com wrote: OK, take word count. The k,v to the map is null,foo bar lambda beta. The canonical Hadoop program would tokenize this line of text and output foo,1 and so on. How would the multithreadedmapper know how to further divide this line of text into, say: [null,foo bar,null,lambda beta] for 2 threads to run in parallel? Can you somehow provide an additional record reader to split the input to the map task into sub-inputs for each thread? In MultithreadedMapper, the IO work is still single threaded, while the map() calling post-read is multithreaded. But yes you could use a mix of CombineFileInputFormat and some custom logic to have multiple local splits per map task, and divide readers of them among your threads. But why do all this when thats what slots at the TT are for? I'm still unsure how the multi-threaded mapper knows how to split the input value into chunks, one chunk for each thread. There is only one example in the Hadoop 0.23 trunk that offers an example: hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/mapreduce/lib/map/TestMultithreadedMapper.java And in that source code, there is no custom logic for local splits per map task at all. Again, going back to the word count example. Given a line of text as input to a map, which comprises of 6 words. I specificy .setNumberOfThreads( 2 ), so ideally, I'd want 3 words analysed by one thread, and the 3 to the other. Is what what would happen? i.e. - I'm unsure whether the multithreadedmapper class does the splitting of inputs to map tasks... Regards,
Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?
Varun unfortunately I have had some problems with deploying a new version on the cluster.. Hadoop is not picking the new build in lib folder despite a classpath is set to it. The new build is picked just if I put it in the $HD_HOME/share/hadoop/, which is very strange.. I've done this on all nodes and can access the web, but all tasktracker are being stopped because of an error: INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Cleanup... java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.filecache.TrackerDistributedCacheManager$CleanupThread.run(TrackerDistributedCacheManager.java:926) Probably the error is the consequence of an inadequate deploy of a jar.. I will ask to the dev list how they do it or are you maybe having any other idea? On 10 February 2012 17:10, Varun Kapoor rez...@hortonworks.com wrote: Hey Merto, Any luck getting the patch running on your cluster? In case you're interested, there's now a JIRA for this: https://issues.apache.org/jira/browse/HADOOP-8052. Varun On Wed, Feb 8, 2012 at 7:45 PM, Varun Kapoor rez...@hortonworks.com wrote: Your general procedure sounds correct (i.e. dropping your newly built .jar into $HD_HOME/lib/), but to make sure it's getting picked up, you should explicitly add $HD_HOME/lib/ to your exported HADOOP_CLASSPATH environment variable; here's mine, as an example: export HADOOP_CLASSPATH=.:./build/*.jar About your second point, you certainly need to copy this newly patched .jar to every node in your cluster, because my patch changes the value of a couple metrics emitted TO gmetad (FROM all the nodes in the cluster), so without copying it over to every node in the cluster, gmetad will still likely receive some bad metrics. Varun On Wed, Feb 8, 2012 at 6:19 PM, Merto Mertek masmer...@gmail.com wrote: I will need your help. Please confirm if the following procedure is right. I have a dev environment where I pimp my scheduler (no hadoop running) and a small cluster environment where the changes(jars) are deployed with some scripts, however I have never compiled the whole hadoop from source so I do not know if I am doing it right. I' ve done it as follow: a) apply a patch b) cd $HD_HOME; ant c) copy $HD_HOME/*build*/patched-core-hadoop.jar - cluster:/$HD_HOME/*lib* d) run $HD_HOME/bin/start-all.sh Is this enough? When I tried to test hadoop dfs -ls / I could see that a new jar was not loaded and instead a jar from $HD_HOME/*share*/hadoop-20.205.0.jar was taken.. Should I copy the entire hadoop folder to all nodes and reconfigure the entire cluster for the new build, or is enough if I configure it just on the node where gmetad will run? On 8 February 2012 06:33, Varun Kapoor rez...@hortonworks.com wrote: I'm so sorry, Merto - like a silly goose, I attached the 2 patches to my reply, and of course the mailing list did not accept the attachment. I plan on opening JIRAs for this tomorrow, but till then, here are links to the 2 patches (from my Dropbox account): - http://dl.dropbox.com/u/4366344/gmetadBufferOverflow.Hadoop.patch - http://dl.dropbox.com/u/4366344/gmetadBufferOverflow.gmetad.patch Here's hoping this works for you, Varun On Tue, Feb 7, 2012 at 6:00 PM, Merto Mertek masmer...@gmail.com wrote: Varun, have I missed your link to the patches? I have tried to search them on jira but I did not find them.. Can you repost the link for these two patches? Thank you.. On 7 February 2012 20:36, Varun Kapoor rez...@hortonworks.com wrote: I'm sorry to hear that gmetad cores continuously for you guys. Since I'm not seeing that behavior, I'm going to just put out the 2 possible patches you could apply and wait to hear back from you. :) Option 1 * Apply gmetadBufferOverflow.Hadoop.patch to the relevant file ( http://svn.apache.org/viewvc/hadoop/common/branches/branch-1/src/core/org/apache/hadoop/metrics2/util/SampleStat.java?view=markupinmysetup ) in your Hadoop sources and rebuild Hadoop. Option 2 * Apply gmetadBufferOverflow.gmetad.patch to gmetad/process_xml.c and rebuild gmetad. Only 1 of these 2 fixes is required, and it would help me if you could first try Option 1 and let me know if that fixes things for you. Varun On Mon, Feb 6, 2012 at 10:36 PM, mete efk...@gmail.com wrote: Same with Merto's situation here, it always overflows short time after the restart. Without the hadoop metrics enabled everything is smooth. Regards Mete On Tue, Feb 7, 2012 at 4:58 AM, Merto Mertek masmer...@gmail.com wrote: I have tried to run it but it repeats crashing.. - When you start gmetad and
Re: HELP - Problem in setting up Hadoop - Multi-Node Cluster
Hi, Is your datanode initially able to connect to Namenode? Have you disabled all the firewalls related services? Do you see any errors at the startup log of Namenode or Datanode? I have dealt with similar kind of this problem earlier. So here is what you can try to do: First, test that ssh is working fine to ensure network is working fine. Ssh into slave from master and ssh into master from the same slave. Leave the ssh session open for as long as u can. In my case when I did the above experiment the ssh session was dropping so I got to know that it's a network related problem. It has got nothing to do with Hadoop. This post might be helpful for u: https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/4165f39d8b0bbc56 Best Regards, Anil On Fri, Feb 10, 2012 at 1:43 AM, Guruprasad B guruprasadk...@gmail.comwrote: Dear Robin, Yes, it is possible. Regards, Guru On Fri, Feb 10, 2012 at 1:23 PM, Robin Mueller-Bady robin.mueller-b...@oracle.com wrote: Dear Guruprasad, is it possible to ping both machines with their hostnames ? (ping master / ping slave) ? Regards, Robin On 10.02.2012 07:46, Guruprasad B wrote: Dear Robin, Thanks for your valuable time and response. please find the attached namenode logs and configurations files. I am using 2 ubuntu boxes.One as master slave and other as slave. below given is the environment set-up in both the machines. : Hadoop : hadoop_0.20.2 Linux: Ubuntu Linux 10.10(master) and Ubuntu Linux 11.04(Slave) Java: java-7-oracle JAVA_HOME and HADOOP_HOME configuration is done in .bashrc file. Both the machines are in LAN and able to ping each other. IP address's of both the machines are configured in /etc/hosts. I do have SSH access to both master and slave as well. please let me know if you need any other information. Thanks in advance. Regards, Guruprasad On Thu, Feb 9, 2012 at 1:06 AM, Robin Mueller-Bady robin.mueller-b...@oracle.com wrote: Dear Guruprasad, it would be very helpful to provide details from your configuration files as well as more details on your setup. It seems to be that the connection from slave to master cannot be established (Connection reset by peer). Do you use a virtual environment, physical master/slaves or all on one machine ? Please paste also the output of kingul2 namenode logs. Regards, Robin On 02/08/12 13:06, Guruprasad B wrote: Hi, I am Guruprasad from Bangalore (India). I need help in setting up hadoop platform. I am very much new to Hadoop Platform. I am following the below given articles and I was able to set up Single-Node Cluster http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#what-we-want-to-do Now I am trying to set up http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#what-we-want-to-doNowIamtryingtosetupMulti-Node Cluster by following the below given article.http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ Below given is my setup: Hadoop : hadoop_0.20.2 Linux: Ubuntu Linux 10.10 Java: java-7-oracle I have successfully reached till the topic Starting the multi-node cluster in the above given article. When I start the HDFS/MapReduce daemons it is getting started and going down immediately both in master slave as well, please have a look at the below logs, hduser@kinigul2:/usr/local/hadoop$ bin/start-dfs.sh starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-kinigul2.out master: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-kinigul2.out slave: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-guruL.out master: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-kinigul2.out hduser@kinigul2:/usr/local/hadoop$ jps 6098 DataNode 6328 Jps 5914 NameNode 6276 SecondaryNameNode hduser@kinigul2:/usr/local/hadoop$ jps 6350 Jps I am getting below given error in slave logs: 2012-02-08 21:04:01,641 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to master/16.150.98.62:54310 failed on local exception: java.io.IOException: Connection reset by peer at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) at org.apache.hadoop.ipc.Client.call(Client.java:743) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy4.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:314) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:291) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:269) at
is 1.0.0 stable?
Hi everyone, I would imagine that 1.0.0 is stable, but the stable link still takes one to the 0.20.203 release. Is 1.0.0 ready for production usage? If not what about 0.20.205? thanks, stan
Re: reference document which properties are set in which configuration file
As a thumb rule, all properties starting with mapred.* or mapreduce.* go to mapred-site.xml, all properties starting with dfs.* go to hdfs-site.xml, and the rest may be put in core-site.xml to be safe. In case you notice MR or HDFS specific properties being outside of this naming convention, please do report a JIRA so we can deprecate the old name and rename it with a more appropriate prefix. On Sat, Feb 11, 2012 at 9:27 AM, Praveen Sripati praveensrip...@gmail.com wrote: The mapred.task.tracker.http.address will go in the mapred-site.xml file. In the Hadoop installation directory check the core-default.xml, hdfs-default,xml and mapred-default.xml files to know about the different properties. Some of the properties which might be in the code may not be mentioned in the xml files and will be defaulted. Praveen On Tue, Feb 7, 2012 at 3:30 PM, Kleegrewe, Christian christian.kleegr...@siemens.com wrote: Dear all, while configuring our hadoop cluster I wonder whether there exists a reference document that contains information about which configuration property has to be specified in which properties file. Especially I do not know where the mapred.task.tracker.http.address has to be set. Is it in the mapre-site.xml or in the hdfs-site.xml? any hint will be appreciated thanks Christian 8-- Siemens AG Corporate Technology Corporate Research and Technologies CT T DE IT3 Otto-Hahn-Ring 6 81739 München, Deutschland Tel.: +49 89 636-42722 Fax: +49 89 636-41423 mailto:christian.kleegr...@siemens.com Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard Cromme; Vorstand: Peter Löscher, Vorsitzender; Roland Busch, Brigitte Ederer, Klaus Helmrich, Joe Kaeser, Barbara Kux, Hermann Requardt, Siegfried Russwurm, Peter Y. Solmssen, Michael Süß; Sitz der Gesellschaft: Berlin und München, Deutschland; Registergericht: Berlin Charlottenburg, HRB 12300, München, HRB 6684; WEEE-Reg.-Nr. DE 23691322 -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about
Re: reference document which properties are set in which configuration file
Harsh, All This was one of the first questions that I asked. It is sometimes not clear whether some parameters are site related or jab related or whether it belongs to NN, JT , DN or TT. If I get some time during the weekend , I will try and put this into a document and see if it helps Raj From: Harsh J ha...@cloudera.com To: common-user@hadoop.apache.org Sent: Friday, February 10, 2012 8:31 PM Subject: Re: reference document which properties are set in which configuration file As a thumb rule, all properties starting with mapred.* or mapreduce.* go to mapred-site.xml, all properties starting with dfs.* go to hdfs-site.xml, and the rest may be put in core-site.xml to be safe. In case you notice MR or HDFS specific properties being outside of this naming convention, please do report a JIRA so we can deprecate the old name and rename it with a more appropriate prefix. On Sat, Feb 11, 2012 at 9:27 AM, Praveen Sripati praveensrip...@gmail.com wrote: The mapred.task.tracker.http.address will go in the mapred-site.xml file. In the Hadoop installation directory check the core-default.xml, hdfs-default,xml and mapred-default.xml files to know about the different properties. Some of the properties which might be in the code may not be mentioned in the xml files and will be defaulted. Praveen On Tue, Feb 7, 2012 at 3:30 PM, Kleegrewe, Christian christian.kleegr...@siemens.com wrote: Dear all, while configuring our hadoop cluster I wonder whether there exists a reference document that contains information about which configuration property has to be specified in which properties file. Especially I do not know where the mapred.task.tracker.http.address has to be set. Is it in the mapre-site.xml or in the hdfs-site.xml? any hint will be appreciated thanks Christian 8-- Siemens AG Corporate Technology Corporate Research and Technologies CT T DE IT3 Otto-Hahn-Ring 6 81739 München, Deutschland Tel.: +49 89 636-42722 Fax: +49 89 636-41423 mailto:christian.kleegr...@siemens.com Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard Cromme; Vorstand: Peter Löscher, Vorsitzender; Roland Busch, Brigitte Ederer, Klaus Helmrich, Joe Kaeser, Barbara Kux, Hermann Requardt, Siegfried Russwurm, Peter Y. Solmssen, Michael Süß; Sitz der Gesellschaft: Berlin und München, Deutschland; Registergericht: Berlin Charlottenburg, HRB 12300, München, HRB 6684; WEEE-Reg.-Nr. DE 23691322 -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about