Re: Streaming data - Avaiable tools
Storm is another project sponsored by ASF. Look here: http://storm.apache.org On 04/07/14 12:28, Adaryl Bob Wakefield, MBA wrote: Storm. It’s not a part of the Apache project but it seems to be what people are using to process event data. B. *From:* santosh.viswanat...@accenture.com mailto:santosh.viswanat...@accenture.com *Sent:* Friday, July 04, 2014 11:25 AM *To:* user@hadoop.apache.org mailto:user@hadoop.apache.org *Subject:* Streaming data - Avaiable tools Hello Experts, Wanted to explore the available tools in the market on streaming data. I know Apache Spark exists. Are there any other tools available? Regards, Santosh Karthikeyan This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Marcos Ortiz http://www.linkedin.com/in/mlortiz (@marcosluis2186 http://twitter.com/marcosluis2186) http://about.me/marcosortiz VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu
Re: Storing videos in Hdfs
What do you want to achieve with this? I've seen that Hadoop is being used for video analytics, just storing video's metadata, quantity of unique views and that kind of stuff; but I've never seen this use-case. A good example of this is Ooyala, which have been used Hadoop+ Apache Cassandra for this, although they migrated to a Spark/Shark + Cassandra solution. They wrote a whitepaper called Designing a Scalable Database for Online Video Analytics and Evan Chan(@evanfchan) did a great talk in the last Cassandra Summit 2013 about how to use Spark/Shark + Cassandra for Real-Time video analytics. -- Marcos Ortiz[1] (@marcosluis2186[2]) http://about.me/marcosortiz[3] On Tuesday, June 17, 2014 06:12:49 PM alajangikish...@gmail.com wrote: Hi hadoopers, What is the best way to store video files in Hdfs? Sent from my iPhone [1] http://www.linkedin.com/in/mlortiz [2] http://twitter.com/marcosluis2186 [3] http://about.me/marcosortiz VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu
Re: MapReduce scalability study
On Thursday, May 22, 2014 10:17:42 PM Sylvain Gault wrote: Hello, I'm new to this mailing list, so forgive me if I don't do everything right. I didn't know whether I should ask on this mailing list or on mapreduce-dev or on yarn-dev. So I'll just start there. ^^ Short story: I'm looking for some paper(s) studying the scalability of Hadoop MapReduce. And I found this extremely difficult to find on google scholar. Do you have something worth citing in a PhD thesis? Long story: I'm writing my PhD thesis about MapReduce and when I talk about Hadoop I'd like to say how much it scales. I heared two years ago some people say that Yahoo! got it scale up to 4000 nodes and plan to try on 6000 nodes or something like that. I also heared that YARN/MRv2 should scale better, but I don't plan to talk much about YARN/MRv2. So I'd take anything I could cite as a reference in my manuscript. :) Hello, Sylvain. One of the reason why the Hadoop dev team began to work in YARN is precisely looking for a more scalable and resourceful Hadoop system, so if you actually want to talk about Hadoop scalability, you should talk about YARN and MR2. The paper is here: https://developer.yahoo.com/blogs/hadoop/next-generation-apache-hadoop-mapreduce-3061.html and the related JIRA issues here: https://issues.apache.org/jira/browse/MAPREDUCE-278 https://issues.apache.org/jira/browse/MAPREDUCE-279 You should talk with Arun C Murthy, Chief Architect at Hortonworks about all these topics. He could help you much more than I could. -- Marcos Ortiz[1] (@marcosluis2186[2]) http://about.me/marcosortiz[3] Best regards, Sylvain Gault [1] http://www.linkedin.com/in/mlortiz [2] http://twitter.com/marcosluis2186 [3] http://about.me/marcosortiz VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu
Re: Job Tracker Stops as Task Tracker starts
What version of JDK are you using in your servers? What version of Hadoop are you using? -- Marcos Ortiz[1] (@marcosluis2186[2]) http://about.me/marcosortiz[3] On Tuesday, May 20, 2014 09:01:07 PM Faisal Rabbani wrote: Hi, I just installed jobtracker and task trackers but as soon as I start any of my tasktrackers Job trackers homepage gives following error: java.lang.NoSuchMethodError: sun.misc.FloatingDecimal.digitsRoundedUp()Z at java.text.DigitList.set(DigitList.java:292) at java.text.DecimalFormat.format(DecimalFormat.java:599) at java.text.DecimalFormat.format(DecimalFormat.java:522) at java.text.NumberFormat.format(NumberFormat.java:271) at org.apache.hadoop.mapred.jobtracker_jsp.generateSummaryTable(jobtracker_jsp. java:26) at org.apache.hadoop.mapred.jobtracker_jsp._jspService(jobtracker_jsp.java:146) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler .java:1221) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(Sta ticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler .java:1212) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.jav a:1069) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler .java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler .java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler .java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerColl ection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnectio n.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582 ) whereas in* jobtracker::50030/machines.jsp?type=active* all tasktrackers are showed in running state hmaster01 http://162.243.238.97:50030/jobtracker.jsp Hadoop Machine ListActive Task Trackers*Task Trackers**Name**Host**# running tasks**Max Map Tasks**Max Reduce Tasks**Task Failures**Directory Failures**Node Health Status**Seconds Since Node Last Healthy**Total Tasks Since Start**Succeeded Tasks Since Start**Total Tasks Last Day**Succeeded Tasks Last Day**Total Tasks Last Hour**Succeeded Tasks Last Hour**Seconds since heartbeat* tracker_hslav01:localhost/127.0.0.1:39451 http://hslav01:50060/hslav010220 0N/Atracker_hslave02:localhost/127.0.0.1:56916http://hslave02:5006 0/ hslave0202200N/Atracker_hslave04:localhost/127.0.0.1:43590http://h slave04:50060/ hslave0402200N/Atracker_hslave03:localhost/127.0.0.1:56552http://h slave03:50060/ hslave0302200N/A -- Hadoop http://hadoop.apache.org/core, 2014. Any suggestions please. -- Thanks, Faisal Ali Rabbani [1] http://www.linkedin.com/in/mlortiz [2] http://twitter.com/marcosluis2186 [3] http://about.me/marcosortiz VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu
Re: Random Exception
It seems that your Hadoop data directory is broken or your disk has problems. Which version of Hadoop are you using? On Friday, May 02, 2014 08:43:44 AM S.L wrote: Hi All, I get this exception after af resubmit my failed MapReduce jon, can one please let me know what this exception means ? 14/05/02 01:28:25 INFO mapreduce.Job: Task Id : attempt_1398989569957_0021_m_00_0, Status : FAILED Error: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1398989569957_0021_m_00_0/intermediate.26 at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWr ite(LocalDirAllocator.java:402) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocato r.java:150) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocato r.java:131) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:711) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:579) at org.apache.hadoop.mapred.Merger.merge(Merger.java:150) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:187 0) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1482) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja va:1548) I Conferencia CientÃfica Internacional UCIENCIA 2014 en la UCI del 24 al 26 de abril de 2014, La Habana, Cuba. Ver http://uciencia.uci.cu
Re: Which database should be used
On Friday, May 02, 2014 04:21:58 PM Alex Lee wrote: There are many database, such as Hbase, hive and mango etc. I need to choose one to save data big volumn stream data from sensors. Will hbase be good, thanks. HBase could be a good allied for this case. You should check OpenTSDB project like a similar case to your problem. http://opentsdb.net/ You should check HBaseCon presentations and videos to see what could you use for your case. I Conferencia CientÃfica Internacional UCIENCIA 2014 en la UCI del 24 al 26 de abril de 2014, La Habana, Cuba. Ver http://uciencia.uci.cu
Re: upgrade to CDH5 from CDH4.6 hadoop 2.0
Regards, Motty This kind of questions, I think that should be asked in the CDH Users mailing list. There, you will obtain a better and a faster answer. Best wishes On Monday, April 28, 2014 01:00:13 PM motty cruz wrote: Hello, I'm upgrading to CDH5. I download latest parcel from http://archive.cloudera.com/cdh5/parcels/latest/ to /oprt/cloudera/parcel-repo next to cluster on cludera under parcels -- I hit the distribution button, started to distribute got to 50% but it does not go any further. any ideas how to proceed? Thanks, I Conferencia CientÃfica Internacional UCIENCIA 2014 en la UCI del 24 al 26 de abril de 2014, La Habana, Cuba. Ver http://uciencia.uci.cu
Re: Intel Hadoop Distribution.
Regards, Chengi Intel is working in a battery-tested Hadoop distribution, with a marked focused on Security Enhancements. You can see it here: https://github.com/intel-hadoop/project-rhino/ Best wishes On 03/01/2013 04:47 PM, Chengi Liu wrote: Hi, I am curious. In this strata, intel made any announcement of their own hadoop distribution optimized for their chips and with some of their own implementations. Though I was bit surprised on seeing intel's involvement in hadoop world.. but now somehow it makes sense (its a big market after all) http://www.javaworld.com/javaworld/jw-02-2013/130227-intel-releases-hadoop-distribution.html I was wondering on how is their distribution different than other players? or why would anyone buy intel's distribution at all? (Probably not suited for this mailing list, then please let me know? ) Thanks -- Marcos Ortiz Valmaseda, Product Manager Data Scientist at UCI Blog: http://marcosluis2186.posterous.com Twitter: @marcosluis2186 http://twitter.com/marcosluis2186
Re: How to handle sensitive data
Regards, abhishek. I´m agree with Michael. You can encrypt your incoming data from your application. I recommend to use HBase too. - Mensaje original - De: Michael Segel michael_se...@hotmail.com Para: common-user@hadoop.apache.org CC: cdh-u...@cloudera.org Enviados: Viernes, 15 de Febrero 2013 8:47:16 Asunto: Re: How to handle sensitive data Simple, have your app encrypt the field prior to writing to HDFS. Also consider HBase. On Feb 14, 2013, at 10:35 AM, abhishek abhishek.dod...@gmail.com wrote: Hi all, we are having some sensitive data, in some particular fields(columns). Can I know how to handle sensitive in Hadoop. How do different people handle sensitive data in Hadoop. Thanks Abhi Michael Segel | (m) 312.755.9623 Segel and Associates -- Marcos Ortiz Valmaseda, Product Manager Data Scientist at UCI Blog : http://marcosluis2186.posterous.com LinkedIn: http://www.linkedin.com/in/marcosluis2186 Twitter : @marcosluis2186
Re: .deflate trouble
Yes, I know, Keith. I know that you want more control over your Hadoop cluster, so I recommend you three things: - You can use Whirr to manage your Hadoop clusters installations en EC2 [1] - You can create your own Hadoop-focused AMI based in your requirements (my favorite choice here) - Or simply install Hadoop on EC2 with Puppet or Chef to have a better control over your configuration and management. - Or, if you have a good pay check, you can choose MapR M3 or M5 distribution in Amazon Marketplace.[2][3] [4] [1] http://whirr.apache.org [2] https://aws.amazon.com/marketplace/pp/B008B7VT2C [3] https://aws.amazon.com/marketplace/pp/B008B7WAAW/ref=sp_mpg_product_title?ie=UTF8sr=0-2 [4] http://aws.amazon.com/es/elasticmapreduce/mapr/ - Mensaje original - De: Keith Wiley kwi...@keithwiley.com Para: user@hadoop.apache.org Enviados: Viernes, 15 de Febrero 2013 12:36:20 Asunto: Re: .deflate trouble I might contact them but we are specifically avoiding EMR for this project. We have already successfully deployed EMR but we want more precise control over the cluster, namely the ability to persist and reawaken it on demand. We really want a direct Hadoop installation instead of an EMR-based installation. But I might contact them anyway to see what they recommend. Thanks for he refs. On Feb 14, 2013, at 19:09 , Marcos Ortiz Valmaseda wrote: Regards, Keith. For EMR issues and stuff, you can contact directly to Jeff Barr(Chief Evangelist for AWS) or to Saurabh Baji (Product Manager for AWS EMR). Best wishes. De: Keith Wiley kwi...@keithwiley.com Para: user@hadoop.apache.org Enviados: Jueves, 14 de Febrero 2013 15:46:05 Asunto: Re: .deflate trouble Good call. We can't use the conventional web-based JT due to corporate access issues, but I looked at the job_XXX.xml file directly, and sure enough, it set mapred.output.compress to true. Now I just need to remember how that occurs. I simply ran the wordcount example straight off the command line, I didn't specify any overridden conf settings for the job. Ultimately, the solution (or part of it) is to get away from .19 to a more up-to-date version of Hadoop. I would prefer 2.0 over 1.0 in fact, but due to a remarkable lack of concise EC2/Hadoop documentation (and the fact that what docs I did find were very old and therefore conformed to .19 style Hadoop), I have fallen back on old versions of Hadoop for my initial tests. In the long run, I will need to get a more modern version of Hadoop to successfully deploy on EC2. Thanks. On Feb 14, 2013, at 15:02 , Harsh J wrote: Did the job.xml of the job that produced this output also carry mapred.output.compress=false in it? The file should be viewable on the JT UI page for the job. Unless explicitly turned out, even 0.19 wouldn't have enabled compression on its own. Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com What I primarily learned in grad school is how much I *don't* know. Consequently, I left grad school with a higher ignorance to knowledge ratio than when I entered. -- Keith Wiley -- Marcos Ortiz Valmaseda, Product Manager Data Scientist at UCI Blog : http://marcosluis2186.posterous.com LinkedIn: http://www.linkedin.com/in/marcosluis2186 Twitter : @marcosluis2186
Re: Hadoop 2.0.3 namenode issue
Regards, Dheeren. It seems that you are using an incompatible version of HDFS with this version of HBase. Can you provide the exact version of your HBase package? - Mensaje original - De: Dheeren Bebortha dbebor...@salesforce.com Para: user@hadoop.apache.org Enviados: Viernes, 15 de Febrero 2013 13:28:19 Asunto: Hadoop 2.0.3 namenode issue HI, In one of our test clusters that has Namenode HA using QJM+ YARN + HBase 0.94, namenode came down with following logs. I am trying to root cause the issue. Any help is appreciated. = 2013-02-13 10:18:27,521 INFO hdfs.StateChange - BLOCK* NameSystem.fsync: file /hbase/.logs/datanode-X.sfdomain.com,60020,1360091866476/datanode-X.sfdomain.com%2C60020%2C1360091866476.1360750706694 for DFSClient_hb_rs_datanode-X.sfdomain.com,60020,1360091866476_470800334_38 2013-02-13 10:20:01,861 WARN ipc.Server - Incorrect header or version mismatch from 10.232.29.4:49933 got version 4 expected version 7 2013-02-13 10:20:01,884 WARN ipc.Server - Incorrect header or version mismatch from 10.232.29.4:49935 got version 4 expected version 7 2013-02-13 10:20:02,550 WARN ipc.Server - Incorrect header or version mismatch from 10.232.29.4:49938 got version 4 expected version 7 2013-02-13 10:20:08,210 INFO namenode.FSNamesystem - Roll Edit Log from 10.232.29.14 = = == 2013-02-13 12:14:32,879 INFO namenode.FileJournalManager - Finalizing edits file /data/hdfs/current/edits_inprogress_0065699 - /data/hdfs/current/edits_0065699-0065700 2013-02-13 12:14:32,879 INFO namenode.FSEditLog - Starting log segment at 65701 2013-02-13 12:15:02,507 INFO namenode.NameNode - FSCK started by sfdc (auth:SIMPLE) from /10.232.29.4 for path / at Wed Feb 13 12:15:02 GMT+00:00 2013 2013-02-13 12:15:02,663 WARN ipc.Server - Incorrect header or version mismatch from 10.232.29.4:40025 got version 4 expected version 7 2013-02-13 12:15:02,663 WARN ipc.Server - Incorrect header or version mismatch from 10.232.29.4:40027 got version 4 expected version 7 2013-02-13 12:15:03,391 WARN ipc.Server - Incorrect header or version mismatch from 10.232.29.4:40031 got version 4 expected version 7 2013-02-13 12:16:33,181 INFO namenode.FSNamesystem - Roll Edit Log from 10.232.29.14 == == -- Marcos Ortiz Valmaseda, Product Manager Data Scientist at UCI Blog : http://marcosluis2186.posterous.com LinkedIn: http://www.linkedin.com/in/marcosluis2186 Twitter : @marcosluis2186
Re: .deflate trouble
Regards, Keith. For EMR issues and stuff, you can contact directly to Jeff Barr(Chief Evangelist for AWS) or to Saurabh Baji (Product Manager for AWS EMR). Best wishes. - Mensaje original - De: Keith Wiley kwi...@keithwiley.com Para: user@hadoop.apache.org Enviados: Jueves, 14 de Febrero 2013 15:46:05 Asunto: Re: .deflate trouble Good call. We can't use the conventional web-based JT due to corporate access issues, but I looked at the job_XXX.xml file directly, and sure enough, it set mapred.output.compress to true. Now I just need to remember how that occurs. I simply ran the wordcount example straight off the command line, I didn't specify any overridden conf settings for the job. Ultimately, the solution (or part of it) is to get away from .19 to a more up-to-date version of Hadoop. I would prefer 2.0 over 1.0 in fact, but due to a remarkable lack of concise EC2/Hadoop documentation (and the fact that what docs I did find were very old and therefore conformed to .19 style Hadoop), I have fallen back on old versions of Hadoop for my initial tests. In the long run, I will need to get a more modern version of Hadoop to successfully deploy on EC2. Thanks. On Feb 14, 2013, at 15:02 , Harsh J wrote: Did the job.xml of the job that produced this output also carry mapred.output.compress=false in it? The file should be viewable on the JT UI page for the job. Unless explicitly turned out, even 0.19 wouldn't have enabled compression on its own. Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com The easy confidence with which I know another man's religion is folly teaches me to suspect that my own is also. -- Mark Twain -- Marcos Ortiz Valmaseda, Product Manager Data Scientist at UCI Blog : http://marcosluis2186.posterous.com LinkedIn: http://www.linkedin.com/in/marcosluis2186 Twitter : @marcosluis2186
Re: Hadoop Tutorial help
Hi, Jennifer. Precisely, Robert Evans, from Yahoo! Team was working in the update of this tutorial to use at least Hadoop 1.x series, but I don´t know right now the progress of the project. OK, now, you don´t need to download Hadoop-0.18.0 because, it´s included in the VMware Hadoop VM. You can download it and try it. Best wishes. - Mensaje original - De: Jennifer Lopez lopez.miri...@gmail.com Para: user@hadoop.apache.org Enviados: Domingo, 9 de Diciembre 2012 10:53:55 Asunto: Hadoop Tutorial help I am going through the tutorial presented @ http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-jobs I have installed vmware and hadoop virtual machine. This tutorial talks about hadoop 0.18.0 version and states that the compilation would be done in the windows host machine. I want to try out simple examples. And now I see that this hadoop 0.18.0 version is not availble @ apache hadoop website. How do I go ahead now? Are any other Hadoop virtual machines available for such tutorials? Any info would be highly appreciated. -- Lopez 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Strange machine behavior
Are you sure that 24 map slots is a good number for this machine? Remember that you have three services (DN, TT and HRegionServer) with with a 12 GB for Heap. Try to use a lower number of map slots (12 for example) and launch your MR job again. Can you share your logs in pastebin? On Sat 08 Dec 2012 07:09:02 PM CST, Robert Dyer wrote: Has anyone experienced a TaskTracker/DataNode behaving like the attached image? This was during a MR job (which runs often). Note the extremely high System CPU time. Upon investigating I saw that out of 64GB ram the system had allocated almost 45GB to cache! I did a sudo sh -c sync ; echo 3 /proc/sys/vm/drop_cache ; sync which is roughly where the graph goes back to normal (much lower System, much higher User). This has happened a few times. I have tried playing with the sysctl vm.swappiness value (default of 60) by setting it to 30 (which it was at when the graph was collected) and now to 10. I am not sure that helps. Any ideas? Anyone else run into this before? 24 cores 64GB ram 4x2TB sata3 hdd Running Hadoop 1.0.4, with a DataNode (2gb heap), TaskTracker (2gb heap) on this machine. 24 map slots (1gb heap each), no reducers. Also running HBase 0.94.2 with a RS (8gb ram) on this machine. -- Marcos Luis Ortíz Valmaseda about.me/marcosortiz http://about.me/marcosortiz @marcosluis2186 http://twitter.com/marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: HADOOP UPGRADE ERROR
On 11/22/2012 08:55 PM, yogesh dhari wrote: Hi All, I am trying to upgrade hadoop-0.20.2 to hadoop-1.0.4. I used command *hadoop namenode -upgrade* after that if I start cluster by command *Start-all.sh the TT and DN doesn't starts.* Which steps did you follow to perform the upgrade process? In Tom White´s Hadoop: The Definitive Guide book, in the Chapter 10, there is a great section dedicated to Upgrades, where he described the basic procedure to do this: 1. Make sure that any previous upgrade is finalized before proceeding with another upgrade. 2. Shut down MapReduce, and kill any orphaned task processes on the tasktrackers. 3. Shut down HDFS, and back up the namenode directories. 4. Install new versions of Hadoop HDFS and MapReduce on the cluster and on clients. 5. Start HDFS with the -upgrade option: $NEW_HADOOP_INSTALL/bin/start-dfs.sh -upgrade 6. Wait until the upgrade is complete: $NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -upgradeProgress status 7. Perform some sanity checks on HDFS. 8. Start MapReduce: 9. Roll back or finalize the upgrade (optional): $NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -finalizeUpgrade $NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -upgradeProgress status 1). Log file of TT... 2012-11-23 07:15:54,399 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:yogesh cause:java.io.IOException: Call to localhost/127.0.0.1:9001 failed on local exception: java.io.IOException: Connection reset by peer 2012-11-23 07:15:54,400 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.io.IOException: Call to localhost/127.0.0.1:9001 failed on local exception: java.io.IOException: Connection reset by peer at org.apache.hadoop.ipc.Client.wrapException(Client.java:1107) at org.apache.hadoop.ipc.Client.call(Client.java:1075) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) Which mode are you have enabled in your Hadoop cluster? 2). Log file of DN... 2012-11-23 07:07:57,095 INFO org.apache.hadoop.hdfs.server.common.Storage: Cannot access storage directory /opt/hadoop_newdata_dirr 2012-11-23 07:07:57,096 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory /opt/hadoop_newdata_dirr does not exist. 2012-11-23 07:07:57,199 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: All specified directories are not accessible or do not exist. at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:139) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:385) Although /opt/hadoop_new_dirr exists with file permission 755. The user yogesh has all privileges in that directory? /opt/hadoop_new_dirr is not the same that /opt/hadoop_newdata_dirr Please suggest ThanksRegards Yogesh Kumar http://www.uci.cu/ -- Marcos Luis Ortíz Valmaseda about.me/marcosortiz http://about.me/marcosortiz @marcosluis2186 http://twitter.com/marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: hadoop - running examples
Mohammad is right. When you write a file to HDFS, it can´t be modified. The pattern in HDFS is write-one/read-many-times. If you want to use a distribution where you can read and write files, you should take a look to MapR distribution. - Mensaje original - De: Mohammad Tariq donta...@gmail.com Para: user@hadoop.apache.org Enviado: Thu, 08 Nov 2012 17:33:56 -0500 (CST) Asunto: Re: hadoop - running examples Apologies for the wrong word. Yes, I meant non modifiable. Regards, Mohammad Tariq On Fri, Nov 9, 2012 at 4:01 AM, Jay Vyas jayunit...@gmail.com wrote: What do you mean immutable? Do u mean non modifiable maybe .? Immutable implies that they can't be deleted . Jay Vyas MMSB UCHC On Nov 8, 2012, at 5:28 PM, Mohammad Tariq donta...@gmail.com wrote: Files are immutable, once written into the Hdfs. And touchz creates a file of 0 length. Regards, Mohammad Tariq On Fri, Nov 9, 2012 at 3:18 AM, Kartashov, Andy andy.kartas...@mpac.cawrote: Guys, When running examples, you bring them into HDFS. Say, you need to make some correction to a file, you need to make them on local FS and run $hadoop fs -put ... again. You cannot just make changes to files inside HDFS except for touchz a file, correct? Just making sure. Thnx, AK NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: monitoring CPU cores (resource consumption) in hadoop
Regards, Jim. In the open source world I don't know. In the Enterprise world, Boundary is a great choice. Look here: http://boundary.com/why-boundary/product/ On 11/03/2012 02:59 PM, ugiwgh wrote: The Paramon can resove this problem. It can monitoring CPU cores. --GHui -- Original -- From: Jim Neofotistosjim.neofotis...@oracle.com; Date: Sun, Nov 4, 2012 03:00 AM To: useruser@hadoop.apache.org; Subject: monitoring CPU cores (resource consumption) in hadoop Standard hadoop monitoring metrics system doesn’t allow the monitoring CPU cores. Ganglia open source monitoring does not have the capability with the RRD tool as well. Top is an option but I was looking for something cluster wide JIm -- Marcos Luis Ortíz Valmaseda about.me/marcosortiz http://about.me/marcosortiz @marcosluis2186 http://twitter.com/marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Set the number of maps
The option since 0.21 was renamed to mapreduce.tasktracker.map.tasks.maximum, and like Harsh said to you, is is a TaskTracker service level option. Another thing is that this option is very united to the mapreduce.child.java.opts, so , make sure to monitor constantly the effect of these changes in your cluster. On 11/01/2012 11:55 AM, Harsh J wrote: It can't be set from the code this way - the slot property is applied at the TaskTracker service level (as the name goes). Since you're just testing at the moment, try to set these values, restart TTs, and run your jobs again. You do not need to restart JT at any point for tweaking these values. On Thu, Nov 1, 2012 at 7:13 PM, Cogan, Peter (Peter) peter.co...@alcatel-lucent.com wrote: Hi I understand that the maximum number of concurrent map tasks is set by mapred.tasktracker.map.tasks.maximum - however I wish to run with a smaller number of maps (am testing disk IO). I thought that I could set that within the main program using conf.set(mapred.tasktracker.map.tasks.maximuma, 4); to run with 4 maps – but that seems to have no impact. I know I could just change the mapred-site.xml and restart map reduce but that's kind of a pain. Can it be set from within the code? Thanks Peter -- Marcos Luis Ortíz Valmaseda about.me/marcosortiz http://about.me/marcosortiz @marcosluis2186 http://twitter.com/marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: File Permissions on s3 FileSystem
El 23/10/12 13:32, Parth Savani escribió: Hello Everyone, I am trying to run a hadoop job with s3n as my filesystem. I changed the following properties in my hdfs-site.xml fs.default.name http://fs.default.name=s3n://KEY:VALUE@bucket/ A good practice to this is to use these two properties in the core-site.xml, if you will use S3 often: property namefs.s3.awsAccessKeyId/name valueAWS_ACCESS_KEY_ID/value /property property namefs.s3.awsSecretAccessKey/name valueAWS_SECRET_ACCESS_KEY/value /property After that, you can access to your URI with a more friendly way: S3: s3://s3-bucket/s3-filepath S3n: s3n://s3-bucket/s3-filepath mapreduce.jobtracker.staging.root.dir=s3n://KEY:VALUE@bucket/tmp When i run the job from ec2, I get the following error The ownership on the staging directory s3n://KEY:VALUE@bucket/tmp/ec2-user/.staging is not as expected. It is owned by The directory must be owned by the submitter ec2-user or by ec2-user at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:113) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:844) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:844) at org.apache.hadoop.mapreduce.Job.submit(Job.java:481) I am using cloudera CDH4 hadoop distribution. The error is thrown from JobSubmissionFiles.java class public static Path getStagingDir(JobClient client, Configuration conf) throws IOException, InterruptedException { Path stagingArea = client.getStagingAreaDir(); FileSystem fs = stagingArea.getFileSystem(conf); String realUser; String currentUser; UserGroupInformation ugi = UserGroupInformation.getLoginUser(); realUser = ugi.getShortUserName(); currentUser = UserGroupInformation.getCurrentUser().getShortUserName(); if (fs.exists(stagingArea)) { FileStatus fsStatus = fs.getFileStatus(stagingArea); String owner = fsStatus.*getOwner();* if (!(owner.equals(currentUser) || owner.equals(realUser))) { throw new IOException(*The ownership on the staging directory +* * stagingArea + is not as expected. + * * It is owned by + owner + . The directory must +* * be owned by the submitter + currentUser + or +* * by + realUser*); } if (!fsStatus.getPermission().equals(JOB_DIR_PERMISSION)) { LOG.info(Permissions on staging directory + stagingArea + are + incorrect: + fsStatus.getPermission() + . Fixing permissions + to correct value + JOB_DIR_PERMISSION); fs.setPermission(stagingArea, JOB_DIR_PERMISSION); } } else { fs.mkdirs(stagingArea, new FsPermission(JOB_DIR_PERMISSION)); } return stagingArea; } I think my job calls getOwner() which returns NULL since s3 does not have file permissions which results in the IO exception that i am getting. Which what user are you launching the job in EC2? Any workaround for this? Any idea how i could you s3 as the filesystem with hadoop on distributed mode? Look here: http://wiki.apache.org/hadoop/AmazonS3 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Java heap space error
Regards, Subash. Can you share more information about your YARN cluster? - Mensaje original - De: Subash D'Souza sdso...@truecar.com Para: user@hadoop.apache.org Enviado: Sun, 21 Oct 2012 09:18:43 -0400 (CDT) Asunto: Java heap space error I'm running CDH 4 on a 4 node cluster each with 96 G of RAM. Up until last week the cluster was running until there was an error in the name node log file and I had to reformat it put the data back Now when I run hive on YARN. I keep getting a Java heap space error. Based on the research I did. I upped the my mapred.child.java.opts first from 200m to 400 m to 800m and I still have the same issue. It seems to fail near the 100% mapper mark I checked the log files and the only thing that it does output is java heap space error. Nothing more. Any help would be appreciated. Thanks Subash 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: hadoop 0.23.3 configurations
Regards, Visioner Look here, is a quick and useful guide to do this: http://practicalcloudcomputing.com/post/26448910436/install-and-run-hadoop-yarn-in-10-easy-steps Best wishes El 11/10/2012 10:07, Visioner Sadak escribi: hi just installed 0.23.3 seems tht configuratons are entirely different anyone knws how to configure java_home hadoop-env.sh and mapred-site.xml are also not present in etc/hadoop/ folder -- Marcos Ortiz Valmaseda, http://about.me/marcosortiz Twitter: @marcosluis2186
Re: issue with permissions of mapred.system.dir
On 10/09/2012 07:44 PM, Goldstone, Robin J. wrote: I am bringing up a Hadoop cluster for the first time (but am an experienced sysadmin with lots of cluster experience) and running into an issue with permissions on mapred.system.dir. It has generally been a chore to figure out all the various directories that need to be created to get Hadoop working, some on the local FS, others within HDFS, getting the right ownership and permissions, etc.. I think I am mostly there but can't seem to get past my current issue with mapred.system.dir. Some general info first: OS: RHEL6 Hadoop version: hadoop-1.0.3-1.x86_64 20 node cluster configured as follows 1 node as primary namenode 1 node as secondary namenode + job tracker 18 nodes as datanode + tasktracker I have HDFS up and running and have the following in mapred-site.xml: property namemapred.system.dir/name valuehdfs://hadoop1/mapred/value descriptionShared data for JT - this must be in HDFS/description /property I have created this directory in HDFS, owner mapred:hadoop, permissions 700 which seems to be the most common recommendation amongst multiple, often conflicting articles about how to set up Hadoop. Here is the top level of my filesystem: hyperion-hdp4@hdfs:hadoop fs -ls / Found 3 items drwx-- - mapred hadoop 0 2012-10-09 12:58 /mapred drwxrwxrwx - hdfs hadoop 0 2012-10-09 13:00 /tmp drwxr-xr-x - hdfs hadoop 0 2012-10-09 12:51 /user Note, it doesn't seem to really matter what permissions I set on /mapred since when the Jobtracker starts up it changes them to 700. However, when I try to run the hadoop example teragen program as a regular user I am getting this error: hyperion-hdp4@robing:hadoop jar /usr/share/hadoop/hadoop-examples*.jar teragen -D dfs.block.size=536870912 100 /user/robing/terasort-input Generating 100 using 2 maps with step of 50 12/10/09 16:27:02 INFO mapred.JobClient: Running job: job_201210072045_0003 12/10/09 16:27:03 INFO mapred.JobClient: map 0% reduce 0% 12/10/09 16:27:03 INFO mapred.JobClient: Job complete: job_201210072045_0003 12/10/09 16:27:03 INFO mapred.JobClient: Counters: 0 12/10/09 16:27:03 INFO mapred.JobClient: Job Failed: Job initialization failed: org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=robing, access=EXECUTE, inode=mapred:mapred:hadoop:rwx-- at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.init(DFSClient.java:3251) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:182) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:536) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:443) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:435) at org.apache.hadoop.security.Credentials.writeTokenStorageFile(Credentials.java:169) at org.apache.hadoop.mapred.JobInProgress.generateAndStoreTokens(JobInProgress.java:3537) at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:696) at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4207) at org.apache.hadoop.mapred.FairScheduler$JobInitializer$InitJob.run(FairScheduler.java:291) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) rest of stack trace omitted This seems to be saying that is trying to write to the HDFS /mapred filesystem as me (robing) rather than as mapred, the username under which the jobtracker and tasktracker run. To verify this is what is happening, I manually changed the permissions on /mapred from 700 to 755 since it claims to want execute access: hyperion-hdp4@mapred:hadoop fs -chmod 755 /mapred hyperion-hdp4@mapred:hadoop fs -ls / Found 3 items drwxr-xr-x - mapred hadoop 0 2012-10-09 12:58 /mapred drwxrwxrwx - hdfs hadoop 0 2012-10-09 13:00 /tmp drwxr-xr-x - hdfs hadoop 0 2012-10-09 12:51 /user hyperion-hdp4@mapred: Now I try running again and it fails again, this time complaining it wants write access to /mapred: hyperion-hdp4@robing:hadoop jar /usr/share/hadoop/hadoop-examples*.jar teragen -D dfs.block.size=536870912 100
Re: use S3 as input to MR job
ngineering Medio SystemsInc|701 Pike St. #1500 Seattle, WA 98101 Predictive Analytics for a Connected World -- Harsh J -- Benjamin Kim benkimkimben at gmail -- Marcos Ortiz Valmaseda, Data Engineer Senior System Administrator at UCI Blog: http://marcosluis2186.posterous.com Linkedin: http://www.linkedin.com/in/marcosluis2186 Twitter: @marcosluis2186
Re: Which hardware to choose
Which is a reasonable number in this hardware? On 10/02/2012 09:40 PM, Michael Segel wrote: I think he's saying that its 24 maps 8 reducers per node and at 48GB that could be too many mappers. Especially if they want to run HBase. On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote: Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's right? Sounds VERY low for a cluster that size. We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance. On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: 38 data nodes + 2 Name Nodes Data Node: Dell PowerEdge C2100 series 2 x XEON x5670 48 GB RAM ECC (12x4GB 1333MHz) 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD Intel Gigabit ET Dual port PCIe x4 Redundant Power Supply Hadoop CDH3 max map tasks 24 max reduce tasks 8 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci -- Marcos Luis Ortíz Valmaseda *Data Engineer Sr. System Administrator at UCI* about.me/marcosortiz http://about.me/marcosortiz My Blog http://marcosluis2186.posterous.com Tumblr's blog http://marcosortiz.tumblr.com/ @marcosluis2186 http://twitter.com/marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Hadoop Archives under 0.23
El 02/10/2012 2:12, Alexander Hristov escribió: Hello I'm trying to test the Hadoop archive functionality under 0.23 and I can't get it working. I have in HDFS a /test folder with several text files. I created a hadoop archive using hadoop archive -archiveName test.har -p /test *.txt /sample Ok, this creates a /sample/test.har with the appropriate parts (_index, _SUCCESS,_masterindex,part-0). Performing a cat on _index shows the texts files. However, when I try to even list the contents of the HAR file using hdfs dfs -ls -R har:///sample/test.har The right command to do this is: hdfs dfs -lsr har:///sample/test.har I simply get har:///sample/test.har : No such file or directory! WTF? Accessing the individual files does work, however: hdfs dfs -cat har:///sample/test.har/file.txt works Regards Alexander 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci -- Marcos Ortiz Valmaseda, Data Engineer Senior System Administrator at UCI Blog: http://marcosluis2186.posterous.com Linkedin: http://www.linkedin.com/in/marcosluis2186 Twitter: @marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: How to run multiple jobs at the same time?
Apache Mahout was built for that Look here: https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering If you don't want to use the Mahout's approach (highly recommended), you can use the MultipleInput class for that: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html An example of the Ton White's Book using MultipleInputs: MultipleInputs.addInputPath(job, ncdcInputPath, TextInputFormat.class, MaxTemperatureMapper.class); MultipleInputs.addInputPath(job, metOfficeInputPath, TextInputFormat.class, MetOfficeMaxTemperatureMapper.class); On 09/23/2012 12:31 PM, Jason Yang wrote: Hi, all I have implemented a K-Means algorithm in MapReduce. This program consists of many iterations and each iteration is a MapReduce Job. here is my pseudo-code: - int count = 0; do { SET input path = output path of last iteration; SET output path = new path(count); ... runJob } while( (!converged) (count maxCount) ) -- Now I got a question that what should I do if I would like to apply this algorithm on multiple data at the same time? Because there are dependency btw iterations, so I have to use JobConf.runJob(), which would block until the iteration finished. Could I use thread? BTW, I'm using hadoop-0.20.2 -- YANG, Lin -- Marcos Luis Ortíz Valmaseda *Data Engineer Sr. System Administrator at UCI* about.me/marcosortiz http://about.me/marcosortiz My Blog http://marcosluis2186.posterous.com Tumblr's blog http://marcosortiz.tumblr.com/ @marcosluis2186 http://twitter.com/marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Suggestions required for learning Hadoop
Regards, Munnavar. There is a great refcardz from DZone, written by Eugene Ciurana (http://eugeneciurana.com), which are perfect for Sysadmins interesting on Hadoop called: - Getting Started with Hadoop - Deploying Hadoop http://refcardz.dzone.com If you want to know more, there are a lot of courses available from Cloudera[1], Hortonworks[2] or MapR[3] [1] http://www.cloudera.com/product-services [2] http://hortonworks.com/ [3] http://academy.mapr.com and you want to go deeper, there are certification programs from Cloudera and Hortonworks Best wishes On 09/13/2012 01:37 PM, Munnavar Shaik wrote: Dear Team Members, I am working as a Linux Administrator, I am interested to work on Hadoop. Please let me know from where and how I can start to learning. It is very great full to help for learning Hadoop and its related project. Thank you Team, *Munnavar* http://www.uci.cu/ -- Marcos Luis Ortíz Valmaseda *Data Engineer Sr. System Administrator at UCI* about.me/marcosortiz http://about.me/marcosortiz My Blog http://marcosluis2186.posterous.com Tumblr's blog http://marcosortiz.tumblr.com/ @marcosluis2186 http://twitter.com/marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Hadoop or HBase
Regards to all the list. Well, you should ask to the Tumblr´s fellows that they use a combination of MySQL and HBase for its blogging platform. They talked about this topic in the last HBaseCon. Here is the link: http://www.hbasecon.com/sessions/growing-your-inbox-hbase-at-tumblr/ Blake Matheny, Director of Platform Engineering at Tumblr was the presenter of this topic. Best wishes El 28/08/2012 6:18, Kai Voigt escribió: Having a distributed filesystem doesn't save you from having backups. If someone deletes a file in HDFS, it's gone. What backend storage is supported by your CMS? Kai Am 28.08.2012 um 08:36 schrieb Kushal Agrawal kushalagra...@teledna.com: As the data is too much in (10's of terabytes) it's difficult to take backup because it takes 1.5 days to take backup of data every time. Instead of that if we uses distributed file system we need not to do that. Thanks Regards, Kushal Agrawal kushalagra...@teledna.com -Original Message- From: Kai Voigt [mailto:k...@123.org] Sent: Tuesday, August 28, 2012 11:57 AM To: common-u...@hadoop.apache.org Subject: Re: Hadoop or HBase Typically, CMSs require a RDBMS. Which Hadoop and HBase are not. Which CMS do you plan to use, and what's wrong with MySQL or other open source RDBMSs? Kai Am 28.08.2012 um 08:21 schrieb Kushal Agrawal kushalagra...@teledna.com: Hi, I wants to use DFS for Content-Management-System (CMS), in that I just wants to store and retrieve files. Please suggest me what should I use: Hadoop or HBase Thanks Regards, Kushal Agrawal kushalagra...@teledna.com One Earth. Your moment. Go green... This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited. -- Kai Voigt k...@123.org 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: distcp error.
Hi, Tao. This problem is only with 2.0.1 or with the two versions? Have you tried to use distcp from 1.0.3 to 1.0.3? El 28/08/2012 11:36, Tao escribió: Hi, all I use distcp copying data from hadoop1.0.3 to hadoop 2.0.1. When the file path(or file name) contain Chinese character, an exception will throw. Like below. I need some help about this. Thanks. [hdfs@host ~]$ hadoop distcp -i -prbugp -m 14 -overwrite -log /tmp/distcp.log hftp://10.xx.xx.aa:50070/tmp/中文路径测试hdfs: //10.xx.xx.bb:54310/tmp/distcp_test14 12/08/28 23:32:31 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=true, maxMaps=14, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[hftp://10.xx.xx.aa:50070/tmp/中文 路径测试], targetPath=hdfs://10.xx.xx.bb:54310/tmp/distcp_test14} 12/08/28 23:32:33 INFO tools.DistCp: DistCp job log path: /tmp/distcp.log 12/08/28 23:32:34 WARN conf.Configuration: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb 12/08/28 23:32:34 WARN conf.Configuration: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor 12/08/28 23:32:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/08/28 23:32:36 INFO mapreduce.JobSubmitter: number of splits:1 12/08/28 23:32:36 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar 12/08/28 23:32:36 WARN conf.Configuration: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 12/08/28 23:32:36 WARN conf.Configuration: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 12/08/28 23:32:36 WARN conf.Configuration: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class 12/08/28 23:32:36 WARN conf.Configuration: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 12/08/28 23:32:36 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name 12/08/28 23:32:36 WARN conf.Configuration: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 12/08/28 23:32:36 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 12/08/28 23:32:36 WARN conf.Configuration: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 12/08/28 23:32:36 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 12/08/28 23:32:36 WARN conf.Configuration: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class 12/08/28 23:32:36 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 12/08/28 23:32:37 INFO mapred.ResourceMgrDelegate: Submitted application application_1345831938927_0039 to ResourceManager at baby20/10.1.1.40:8040 12/08/28 23:32:37 INFO mapreduce.Job: The url to track the job: http://baby20:8088/proxy/application_1345831938927_0039/ 12/08/28 23:32:37 INFO tools.DistCp: DistCp job-id: job_1345831938927_0039 12/08/28 23:32:37 INFO mapreduce.Job: Running job: job_1345831938927_0039 12/08/28 23:32:50 INFO mapreduce.Job: Job job_1345831938927_0039 running in uber mode : false 12/08/28 23:32:50 INFO mapreduce.Job: map 0% reduce 0% 12/08/28 23:33:00 INFO mapreduce.Job: map 100% reduce 0% 12/08/28 23:33:00 INFO mapreduce.Job: Task Id : attempt_1345831938927_0039_m_00_0, Status : FAILED Error: java.io.IOException: File copy failed: hftp://10.1.1.26:50070 /tmp/中文路径测试/part-r-00017 -- hdfs://10.1.1.40:54310/tmp/distcp_test14/part-r-00017 at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:262) at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:229) at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:45) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:725) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147) Caused by: java.io.IOException: Couldn't run retriable-command: Copying hftp://10.1.1.26:50070/tmp/中文路径测试/part-r-00017 to hdfs://10.1.1.40:54310/tmp/distcp_test14/part-r-00017 at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101) at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:258) ... 10 more Caused by:
Re: Hadoop 1.0.3 setup
On 07/09/2012 09:58 AM, prabhu K wrote: Yes, i have configuared multinode setup, 1 master 2 slaves, i have formated the namenode and then i run the stat-dfs.sh script and start-mapred.sh script. I run the bin/hadoop fs -put input input command , getting following error on my terminal. hduser@md-trngpoc1:/usr/local/hadoop_dir/hadoop$ bin/hadoop fs -put input input Warning: $HADOOP_HOME is deprecated. put: org.apache.hadoop.security.AccessControlException: Permission denied: user=hduser, access=WRITE, inode=:root:supergroup:rwxr-xr-x and executed the below command, getting /hadoop-install/hadoop directroy, i coud't understand what's wrong iam doing? Well, this erros says to you that you have the wrong permissions in the hadoop directory, the user and group that you have is root:supergroup and the correct values for it is: hduser:supergroup hduser@md-trngpoc1:/usr/local/hadoop_dir/hadoop$ echo $HADOOP_HOME /hadoop-install/hadoop *Namenode log:* == java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.hdfs.server.namenode.DecommissionManager$Monitor.run(DecommissionManager.java:65) at java.lang.Thread.run(Thread.java:662) 2012-07-09 19:02:12,696 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.net.BindException: Problem binding to md-trngpoc1/10.5.114.110:54310 : Address alrea dy in use It seems that you are using that address:port values. Use this commands: netstat -puta | grep namenode netstat -puta | grep datanode to check which are the ports that the NN and DN are using. at org.apache.hadoop.ipc.Server.bind(Server.java:227) at org.apache.hadoop.ipc.Server$Listener.init(Server.java:301) at org.apache.hadoop.ipc.Server.init(Server.java:1483) at org.apache.hadoop.ipc.RPC$Server.init(RPC.java:545) at org.apache.hadoop.ipc.RPC.getServer(RPC.java:506) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:294) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:496) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288) Caused by: java.net.BindException: Address already in use at sun.nio.ch.Net.bind(Native Method) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59) at org.apache.hadoop.ipc.Server.bind(Server.java:225) ... 8 more *Datanode log* = 2012-07-09 18:44:39,949 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = md-trngpoc3/10.5.114.168 STARTUP_MSG: args = [] STARTUP_MSG: version = 1.0.3 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192; compiled by 'hortonfo' on Tue May 8 20:31:25 UTC 2012 / 2012-07-09 18:44:40,039 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2012-07-09 18:44:40,047 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered. 2012-07-09 18:44:40,048 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2012-07-09 18:44:40,048 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system started 2012-07-09 18:44:40,125 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered. 2012-07-09 18:44:40,163 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Invalid directory in dfs.data.dir: can not create directory: /app/hadoop_dir/hadoop/tmp/df s/data 2012-07-09 18:44:40,163 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: All directories in dfs.data.dir are invalid. 2012-07-09 18:44:40,163 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode 2012-07-09 18:44:40,164 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down DataNode at md-trngpoc3/10.5.114.168 / 2012-07-09 18:46:09,586 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = md-trngpoc3/10.5.114.168 STARTUP_MSG: args = [] STARTUP_MSG: version = 1.0.3 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192; compiled by 'hortonfo' on Tue May 8 20:31:25 UTC 2012
Re: Versions
On 07/07/2012 02:39 PM, Harsh J wrote: The Apache Bigtop project was started for this very purpose (building stable, well inter-operating version stacks). Take a read at http://incubator.apache.org/bigtop/ and for 1.x Bigtop packages, see https://cwiki.apache.org/confluence/display/BIGTOP/How+to+install+Hadoop+distribution+from+Bigtop To specifically answer your question though, your list appears fine to me. They 'should work', but I am not suggesting that I have tested this stack completely myself. On Sat, Jul 7, 2012 at 11:57 PM, prabhu K prabhu.had...@gmail.com wrote: Hi users list, I am planing to install following tools. Hadoop 1.0.3 hive 0.9.0 flume 1.2.0 Hbase 0.92.1 sqoop 1.4.1 My only suggestion here is that you use the 0.94 version of HBase, it has a lot of improvements over 0.92.1 See the Cloudera's blog post for it: http://www.cloudera.com/blog/2012/05/apache-hbase-0-94-is-now-released/ Best wishes my questions are. 1. the above tools are compatible with all the versions. 2. any tool need to change the version 3. list out all the tools with compatible versions. Please suggest on this? -- Marcos Luis Ortíz Valmaseda *Data Engineer Sr. System Administrator at UCI* 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: set up Hadoop cluster on mixed OS
I have a mixed cluster too, with Linux (CentOS) and Solaris, the unique recommendation that I can give you is to use exactly the same Hadoop version in all machines. Best wishes On 07/06/2012 05:31 AM, Senthil Kumar wrote: You can setup hadoop cluster on mixed environment. We have a cluster with Mac, Linux and Solaris. Regards Senthil On Fri, Jul 6, 2012 at 1:50 PM, Yongwei Xing jdxyw2...@gmail.com wrote: I have one MBP with 10.7.4 and one laptop with Ubuntu 12.04. Is it possible to set up a hadoop cluster by such mixed environment? Best Regards, -- Welcome to my ET Blog http://www.jdxyw.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci -- Marcos Luis Ortíz Valmaseda *Data Engineer Sr. System Administrator at UCI* about.me/marcosortiz http://about.me/marcosortiz My Blog http://marcosluis2186.posterous.com @marcosluis2186 http://twitter.com/marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: How to connect to a cluster by using eclipse
Jason, Ramon is right. The best way to debug a MapReduce job is mounting a local cluster, and then, when you have tested enough your code, then, you can deploy it in a real distributed cluster. On 07/04/2012 10:00 PM, Jason Yang wrote: ramon, Thank for your reply very much. However, I was still wonder whether I could debug a MR application in this way. I have read some posts talking about using NAT to redirect all the packets to the network card which connect to the local LAN, but it does not work as I tried to redirect by using iptables :( 在 2012年7月4日星期三, 写道: Jason, the easiest way to debug a MapRedupe program with eclipse is working on hadoop local. http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#Local In this mode all the components run locally on the same VM and can be easily debugged using Eclipse. Hope this will be useful. *From:*Jason Yang [mailto:lin.yang.ja...@gmail.com javascript:_e({}, 'cvml', 'lin.yang.ja...@gmail.com');] *Sent:* miércoles, 04 de julio de 2012 11:25 *To:* mapreduce-user *Subject:* How to connect to a cluster by using eclipse Hi, all I have a hadoop cluster with 3 nodes, the network topology is like this: 1. For each DataNode, its IP address is like :192.168.0.XXX; 2. For the NameNode, it has two network cards: one is connect with the DataNodes as a local LAN with IP address 192.168.0.110, while the other one is connect to the company network(which eventually connect to the Internet); -- now I'm trying to debug a MapReduce program on a computer which is in the company network. Since the jobtracker in this scenario is 192.168.0.110:9001 http://192.168.0.110:9001, I was wondering how could I connect to the cluster by using eclipse? -- YANG, Lin Subject to local law, communications with Accenture and its affiliates including telephone calls and emails (including content), may be monitored by our systems for the purposes of security and the assessment of internal compliance with Accenture policy. __ www.accenture.com http://www.accenture.com -- YANG, Lin -- Marcos Luis Ortíz Valmaseda *Data Engineer Sr. System Administrator at UCI* about.me/marcosortiz http://about.me/marcosortiz My Blog http://marcosluis2186.posterous.com @marcosluis2186 http://twitter.com/marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Yarn job runs in Local Mode even though the cluster is running in Distributed Mode
According to the CDH 4 official documentation, you should install a JobHistory server for your MRv2 (YARN) cluster. https://ccp.cloudera.com/display/CDH4DOC/Deploying+MapReduce+v2+%28YARN%29+on+a+Cluster How to configure the HistoryServer https://ccp.cloudera.com/display/CDH4DOC/Deploying+MapReduce+v2+%28YARN%29+on+a+Cluster#DeployingMapReducev2%28YARN%29onaCluster-Step3 On 06/13/2012 03:16 PM, anil gupta wrote: Hi All I am using cdh4 for running a HBase cluster on CentOs6.0. I have 5 nodes in my cluster(2 Admin Node and 3 DN). My resourcemanager is up and running and showing that all three DN are running the nodemanager. HDFS is also working fine and showing 3 DN's. But when i fire the pi example job. It starts to run in Local mode. Here is the console output: sudo -u hdfs yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce- examples.jar pi 10 10 Number of Maps = 10 Samples per Map = 10 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job 12/06/13 12:03:27 WARN conf.Configuration: session.id is deprecated. Instead, use dfs.metrics.session-id 12/06/13 12:03:27 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 12/06/13 12:03:27 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/06/13 12:03:27 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/06/13 12:03:28 INFO mapred.FileInputFormat: Total input paths to process : 10 12/06/13 12:03:29 INFO mapred.JobClient: Running job: job_local_0001 12/06/13 12:03:29 INFO mapred.LocalJobRunner: OutputCommitter set in config null 12/06/13 12:03:29 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter 12/06/13 12:03:29 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 12/06/13 12:03:29 INFO util.ProcessTree: setsid exited with exit code 0 12/06/13 12:03:29 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d46e381 12/06/13 12:03:29 WARN mapreduce.Counters: Counter name MAP_INPUT_BYTES is deprecated. Use FileInputFormatCounters as group name and BYTES_READ as counter name instead 12/06/13 12:03:29 INFO mapred.MapTask: numReduceTasks: 1 12/06/13 12:03:29 INFO mapred.MapTask: io.sort.mb = 100 12/06/13 12:03:30 INFO mapred.MapTask: data buffer = 79691776/99614720 12/06/13 12:03:30 INFO mapred.MapTask: record buffer = 262144/327680 12/06/13 12:03:30 INFO mapred.JobClient: map 0% reduce 0% 12/06/13 12:03:35 INFO mapred.LocalJobRunner: Generated 95735000 samples. 12/06/13 12:03:36 INFO mapred.JobClient: map 100% reduce 0% 12/06/13 12:03:38 INFO mapred.LocalJobRunner: Generated 151872000 samples. Here is the content of yarn-site.xml: configuration property nameyarn.nodemanager.aux-services/name valuemapreduce.shuffle/value /property property nameyarn.nodemanager.aux-services.mapreduce.shuffle.class/name valueorg.apache.hadoop.mapred.ShuffleHandler/value /property property nameyarn.log-aggregation-enable/name valuetrue/value /property property descriptionList of directories to store localized files in./ description nameyarn.nodemanager.local-dirs/name value/disk/yarn/local/value /property property descriptionWhere to store container logs./description nameyarn.nodemanager.log-dirs/name value/disk/yarn/logs/value /property property descriptionWhere to aggregate logs to./description nameyarn.nodemanager.remote-app-log-dir/name value/var/log/hadoop-yarn/apps/value /property property descriptionClasspath for typical applications./description nameyarn.application.classpath/name value $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*, $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*, $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*, $YARN_HOME/*,$YARN_HOME/lib/* /value /property property nameyarn.resourcemanager.resource-tracker.address/name valueihub-an-g1:8025/value /property property nameyarn.resourcemanager.address/name valueihub-an-g1:8040/value /property property nameyarn.resourcemanager.scheduler.address/name valueihub-an-g1:8030/value /property property nameyarn.resourcemanager.admin.address/name valueihub-an-g1:8141/value /property property nameyarn.resourcemanager.webapp.address/name valueihub-an-g1:8088/value /property property namemapreduce.jobhistory.intermediate-done-dir/name value/disk/mapred/jobhistory/intermediate/done/value /property property
Re: Yarn job runs in Local Mode even though the cluster is running in Distributed Mode
Can you share with us in pastebin all conf files that you are using for YARN? On 06/13/2012 05:26 PM, anil gupta wrote: Hi Marcus, Sorry i forgot to mention that Job history server is installed and running and AFAIK resourcemanager is responsible for running MR jobs. Historyserver is only used to get info about MR jobs. Thanks, Anil On Wed, Jun 13, 2012 at 2:04 PM, Marcos Ortiz mlor...@uci.cu mailto:mlor...@uci.cu wrote: According to the CDH 4 official documentation, you should install a JobHistory server for your MRv2 (YARN) cluster. https://ccp.cloudera.com/display/CDH4DOC/Deploying+MapReduce+v2+%28YARN%29+on+a+Cluster How to configure the HistoryServer https://ccp.cloudera.com/display/CDH4DOC/Deploying+MapReduce+v2+%28YARN%29+on+a+Cluster#DeployingMapReducev2%28YARN%29onaCluster-Step3 On 06/13/2012 03:16 PM, anil gupta wrote: Hi All I am using cdh4 for running a HBase cluster on CentOs6.0. I have 5 nodes in my cluster(2 Admin Node and 3 DN). My resourcemanager is up and running and showing that all three DN are running the nodemanager. HDFS is also working fine and showing 3 DN's. But when i fire the pi example job. It starts to run in Local mode. Here is the console output: sudo -u hdfs yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce- examples.jar pi 10 10 Number of Maps = 10 Samples per Map = 10 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job 12/06/13 12:03:27 WARN conf.Configuration: session.id http://session.id is deprecated. Instead, use dfs.metrics.session-id 12/06/13 12:03:27 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 12/06/13 12:03:27 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/06/13 12:03:27 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/06/13 12:03:28 INFO mapred.FileInputFormat: Total input paths to process : 10 12/06/13 12:03:29 INFO mapred.JobClient: Running job: job_local_0001 12/06/13 12:03:29 INFO mapred.LocalJobRunner: OutputCommitter set in config null 12/06/13 12:03:29 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter 12/06/13 12:03:29 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 12/06/13 12:03:29 INFO util.ProcessTree: setsid exited with exit code 0 12/06/13 12:03:29 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d46e381 12/06/13 12:03:29 WARN mapreduce.Counters: Counter name MAP_INPUT_BYTES is deprecated. Use FileInputFormatCounters as group name and BYTES_READ as counter name instead 12/06/13 12:03:29 INFO mapred.MapTask: numReduceTasks: 1 12/06/13 12:03:29 INFO mapred.MapTask: io.sort.mb = 100 12/06/13 12:03:30 INFO mapred.MapTask: data buffer = 79691776/99614720 12/06/13 12:03:30 INFO mapred.MapTask: record buffer = 262144/327680 12/06/13 12:03:30 INFO mapred.JobClient: map 0% reduce 0% 12/06/13 12:03:35 INFO mapred.LocalJobRunner: Generated 95735000 samples. 12/06/13 12:03:36 INFO mapred.JobClient: map 100% reduce 0% 12/06/13 12:03:38 INFO mapred.LocalJobRunner: Generated 151872000 samples. Here is the content of yarn-site.xml: configuration property nameyarn.nodemanager.aux-services/name valuemapreduce.shuffle/value /property property nameyarn.nodemanager.aux-services.mapreduce.shuffle.class/name valueorg.apache.hadoop.mapred.ShuffleHandler/value /property property nameyarn.log-aggregation-enable/name valuetrue/value /property property descriptionList of directories to store localized files in./ description nameyarn.nodemanager.local-dirs/name value/disk/yarn/local/value /property property descriptionWhere to store container logs./description nameyarn.nodemanager.log-dirs/name value/disk/yarn/logs/value /property property descriptionWhere to aggregate logs to./description
Re: override mapred-site.xml from command line
On 06/06/2012 07:44 PM, Sid Kumar wrote: I am able to set it via the API. Configuration.setBoolean(mapred.output.compress,true). This works! But the -D from the command line still doesn't work. Any idea what I may be missing here? Some additional info - Also when I try running the -D on command line on a local cluster (pseudo distributed mode) it works, but when I try it on a fully distributed cluster running jobs from a client machine it doesn't work. Is there a different way for setting it in this case - in hadoop-env perhaps? Thanks Sid On Wed, Jun 6, 2012 at 4:06 PM, Sid Kumar sqlsid...@gmail.com mailto:sqlsid...@gmail.com wrote: Mayank, I dont have a final tag for that property set. I looked at the mapred-default.xml in the src/mapred folder and that doesn't have a final tag too. Should I set it explicitly to false? You should do it explicitly. You should read the excellent blog post from Lars Francke where he did a great job explaining parameter by parameter and why is recommendable to set them to final. http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html Regards Sid On Wed, Jun 6, 2012 at 3:50 PM, Mayank Bansal may...@apache.org mailto:may...@apache.org wrote: Check your mapred site xml if these parameters have finaltrue/final making final to false should solve your problem. On Wed, Jun 6, 2012 at 3:41 PM, Sid Kumar sqlsid...@gmail.com mailto:sqlsid...@gmail.com wrote: Hi, I am trying to override mapred-site.xml (more specifically mapred.compress.map.output and mapred.output.compression. codec) from the command line when I execute the jar. I have been using hadoop jar jarname class - Dmapred.compress.map.output=true and -Dmapred.output.compression.codec=org.apache.hadoop.io.SnappyCodec The above doesnt work as the job.xml for the jar still uses the default properties and not the one i specify here. Is there a different approach to override these properties. I am submitting jobs from a client machine that has the same version of configuration files as my cluster. Thanks Sid -- Marcos Luis Ortíz Valmaseda Data Engineer Sr. System Administrator at UCI http://marcosluis2186.posterous.com http://www.linkedin.com/in/marcosluis2186 Twitter: @marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: HBase is able to connect to ZooKeeper but the connection closes immediately
Can you show us the code that you are developing? Which HBase version are you using ? Yo should check if you are creating multiples HBaseConfiguration objects. The approach to this is to create one single HBaseConfiguration object and then reuse it in all your code. Regards On 06/06/2012 10:25 AM, Manu S wrote: Hi All, We are running a mapreduce job in a fully distributed cluster.The output of the job is writing to HBase. While running this job we are getting an error: *Caused by: org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper but the connection closes immediately. This could be a sign that the server has too many connections (30 is the default). Consider inspecting your ZK server logs for that error and then make sure you are reusing HBaseConfiguration as often as you can. See HTable's javadoc for more information.* at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.init(ZooKeeperWatcher.java:155) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:1002) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:304) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.init(HConnectionManager.java:295) at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:157) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:169) at org.apache.hadoop.hbase.client.HTableFactory.createHTableInterface(HTableFactory.java:36) I had gone through some threads related to this issue and I modified the *zoo.cfg* accordingly. These configurations are same in all the nodes. Please find the configuration of HBase ZooKeeper: Hbase-site.xml: configuration property namehbase.cluster.distributed/name valuetrue/value /property property namehbase.rootdir/name valuehdfs://namenode/hbase/value /property property namehbase.zookeeper.quorum/name valuenamenode/value /property /configuration Zoo.cfg: # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. dataDir=/var/zookeeper # the port at which the clients will connect clientPort=2181 #server.0=localhost:2888:3888 server.0=namenode:2888:3888 # Max Client connections ### *maxClientCnxns=1000 minSessionTimeout=4000 maxSessionTimeout=4* It would be really great if anyone can help me to resolve this issue by giving your thoughts/suggestions. Thanks, Manu S -- Marcos Luis Ortíz Valmaseda Data Engineer Sr. System Administrator at UCI http://marcosluis2186.posterous.com http://www.linkedin.com/in/marcosluis2186 Twitter: @marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: No space left on device
Do you have the JT and NN on the same node? Look here on the Lars Francke´s post: http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html This is a very schema how to install Hadoop, and look the configuration that he used for the name and data directories. If this directories are in the same disk, and you don´t have enough space for it, you can find that exception. My recomendation is to divide these directories in separate discs with a very similar schema to the Lars´s configuration Another recomendation is to check the Hadoop´s logs. Read about this here: http://www.cloudera.com/blog/2010/11/hadoop-log-location-and-retention/ regards On 05/28/2012 02:20 AM, yingnan.ma wrote: ok,I find it. the jobtracker server is full. 2012-05-28 yingnan.ma 发件人: yingnan.ma 发送时间: 2012-05-28 13:01:56 收件人: common-user 抄送: 主题: No space left on device Hi, I encounter a problem as following: Error - Job initialization failed: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:201) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at java.io.FilterOutputStream.close(FilterOutputStream.java:140) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:348) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86) at org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:1344) .. So, I think that the HDFS is full or something, but I cannot find a way to address the problem, if you had some suggestion, Please show me , thank you. Best Regards -- Marcos Luis Ortíz Valmaseda Data Engineer Sr. System Administrator at UCI http://marcosluis2186.posterous.com http://www.linkedin.com/in/marcosluis2186 Twitter: @marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: EOFException at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)......
Regards, waqas. I think that you have to ask to MapR experts. On 05/25/2012 05:42 AM, waqas latif wrote: Hi Experts, I am fairly new to hadoop MapR and I was trying to run a matrix multiplication example presented by Mr. Norstadt under following link http://www.norstad.org/matrix-multiply/index.html. I can run it successfully with hadoop 0.20.2 but I tried to run it with hadoop 1.0.3 but I am getting following error. Is it the problem with my hadoop configuration or it is compatibility problem in the code which was written in hadoop 0.20 by author.Also please guide me that how can I fix this error in either case. Here is the error I am getting. The same code that you write for 0.20.2 should work in 1.0.3 too. in thread main java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readFully(DataInputStream.java:152) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470) at TestMatrixMultiply.fillMatrix(TestMatrixMultiply.java:60) at TestMatrixMultiply.readMatrix(TestMatrixMultiply.java:87) at TestMatrixMultiply.checkAnswer(TestMatrixMultiply.java:112) at TestMatrixMultiply.runOneTest(TestMatrixMultiply.java:150) at TestMatrixMultiply.testRandom(TestMatrixMultiply.java:278) at TestMatrixMultiply.main(TestMatrixMultiply.java:308) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Thanks in advance Regards, waqas Can you put here the completed log for this? Best wishes -- Marcos Luis Ortíz Valmaseda Data Engineer Sr. System Administrator at UCI http://marcosluis2186.posterous.com http://www.linkedin.com/in/marcosluis2186 Twitter: @marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: While Running in cloudera version of hadoop getting error
Why don´t use the same Hadoop version in both clusters? It will brings to you minor troubles. On 05/24/2012 02:26 PM, samir das mohapatra wrote: Hi I created application jar and i was trying to run in 2 node cluster using cludera .20 version , it was running fine, But when i am running that same jar in Deployment server (Cloudera version .20.x ) having 40 node cluster I am getting error cloude any one please help me with this. 12/05/24 09:39:09 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. Like this says here, you should implement Tool for your MapReduce Job 12/05/24 09:39:10 INFO mapred.FileInputFormat: Total input paths to process : 1 12/05/24 09:39:10 INFO mapred.JobClient: Running job: job_201203231049_12426 12/05/24 09:39:11 INFO mapred.JobClient: map 0% reduce 0% 12/05/24 09:39:20 INFO mapred.JobClient: Task Id : attempt_201203231049_12426_m_00_0, Status : FAILED java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav attempt_201203231049_12426_m_00_0: getDefaultExtension() 12/05/24 09:39:20 INFO mapred.JobClient: Task Id : attempt_201203231049_12426_m_01_0, Status : FAILED Thanks samir -- Marcos Luis Ortíz Valmaseda Data Engineer Sr. System Administrator at UCI http://marcosluis2186.posterous.com http://www.linkedin.com/in/marcosluis2186 Twitter: @marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Is it okay to upgrade from CDH3U2 to hadoop 1.0.2 and hbase 0.92.1?
I think that you should follow the CDH4 Beta 2 docs, specifically the know issues for this version: https://ccp.cloudera.com/display/CDH4B2/Known+Issues+and+Work+Arounds+in+CDH4 Then, you should see the HBase installation and upgrading on this version: https://ccp.cloudera.com/display/CDH4B2/HBase+Installation#HBaseInstallation-InstallingHBase Another thing that you keep in mind is that with HBase 0.92.1, you should restart your cluster because the wire protocol changed from 0.90 to 0.92, so, the rolling restarts do not work here. Best wishes On 05/21/2012 10:44 PM, edward choi wrote: Hi, I have used CDH3U2 for almost a year now. Since it is a quite old distribution, there are certain glitches that keep bothering me. So I was considering upgrading to Hadoop 1.0.3 and Hbase 0.92.1. My concern is that, if it is okay to just install the new packages and set the configurations the same as before? Or do I need to download all the files on HDFS to local hard drive and upload them again once the new packages are installed? (that would be a horrible job to do though) Any advice will be helpful. Thanks. Ed 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci -- Marcos Luis Ortíz Valmaseda Data Engineer Sr. System Administrator at UCI http://marcosluis2186.posterous.com http://www.linkedin.com/in/marcosluis2186 Twitter: @marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: namenode directory disappear after machines restart
This is an usual behavior on Unix/Linux systems. When you restart the system, the content of the /tmp directory is cleaned, because precisely, the purpose of this directory is to keep files temporally. For that reason, the data directory for the HDFS filesystem should be another directory, /var/hadoop/data for example, of course, a directory durable in time. So, you should change your dfs.name.dir and your dfs.data.dir variable in your hdfs-site.xml. Regards On 05/21/2012 11:21 PM, Brendan cheng wrote: Hi, I'm not sure if there is a setting to avoid the Namenode removed after hosting machine of Namenode restart.I found that after successfully installed single node pseudo distributed hadoop following from your website, the name node dir /tmp/hadoop-brendan/dfs/name are removed if machine reboot. What do I miss? Brendan 2012-05-22 11:14:05,678 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory /tmp/hadoop-brendan/dfs/name does not exist.2012-05-22 11:14:05,680 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed.org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-brendan/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible.at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:303) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:362) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:496) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)2012-05-22 11:14:05,685 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-brendan/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:303) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:362) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:496) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288) 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci -- Marcos Luis Ortíz Valmaseda Data Engineer Sr. System Administrator at UCI http://marcosluis2186.posterous.com http://www.linkedin.com/in/marcosluis2186 Twitter: @marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: hadoop on fedora 15
On 04/26/2012 01:49 AM, john cohen wrote: I had the same issue. My problem was the use of VPN connected to work, and at the same time working with M/R jobs on my Mac. It occurred to me that maybe Hadoop was binding to the wrong IP (the IP given to you after connecting through VPN), bottom line, I disconnect from the VPN, and the M/R job finished as expected after that. This is logic because, after that you configure to connect to the VPN, your machines have other IPs, based on the request of the private network. You can test, changing the IPs for the new ones based on the VPN request. -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: unable to resolve the heap space error even when running the examples
Can you show to us the logs of your NN/DN? On 04/12/2012 03:28 AM, SRIKANTH KOMMINENI (RIT Student) wrote: Tried that it didn't work for a lot of combinations of values On Thu, Apr 12, 2012 at 3:25 AM, Mapred Learn mapred.le...@gmail.com mailto:mapred.le...@gmail.com wrote: Try exporting HADOOP_HEAPSIZE to bigger value like 1500 (1.5 gb) before running program or change it in hadoop-env.sh If still gives error, u can try with bigger value. Sent from my iPhone On Apr 12, 2012, at 12:10 AM, SRIKANTH KOMMINENI (RIT Student) sxk7...@rit.edu mailto:sxk7...@rit.edu wrote: Hello, I have searched a lot and still cant find any solution that can fix my problem. I am using the the basic downloaded version of hadoop-1.0.2 and I have edited only what has been asked in the setup page of hadoop and I have set it up to work in a pseudo random distributed mode. My JAVA_HOME is set to /usr/lib/jvm/java-6-sun, I tried editing the heap size in hadoop-env.sh that didn't work. I tried setting the CHILD_OPTS that didn't work, I found that there was another hadoop-env.sh in /etc/hadoop/ as per the recommendations in the mailing list archives that didn't work . I tried increasing the io.sort.mb that didn't work. I am totally frustrated but it still doesn't work.please help. -- Srikanth Kommineni, Graduate Assistant, Dept of Computer Science, Rochester Institute of Technology. -- Srikanth Kommineni, Graduate Assistant, Dept of Computer Science, Rochester Institute of Technology. -- Srikanth Kommineni, Graduate Assistant, Dept of Computer Science, Rochester Institute of Technology. -- Srikanth Kommineni, Graduate Assistant, Dept of Computer Science, Rochester Institute of Technology. -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Yahoo Hadoop Tutorial with new APIs?
Regards to all the list. There are many people that use the Hadoop Tutorial released by Yahoo at http://developer.yahoo.com/hadoop/tutorial/ http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining The main issue here is that, this tutorial is written with the old APIs? (Hadoop 0.18 I think). Is there a project for update this tutorial to the new APIs? to Hadoop 1.0.2 or YARN (Hadoop 0.23) Best wishes -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Yahoo Hadoop Tutorial with new APIs?
Regards to all the list. There are many people that use the Hadoop Tutorial released by Yahoo at http://developer.yahoo.com/hadoop/tutorial/ http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining The main issue here is that, this tutorial is written with the old APIs? (Hadoop 0.18 I think). Is there a project for update this tutorial to the new APIs? to Hadoop 1.0.2 or YARN (Hadoop 0.23) Best wishes -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Yahoo Hadoop Tutorial with new APIs?
On 04/04/2012 09:15 AM, Jagat Singh wrote: Hello Marcos Yes , Yahoo tutorials are pretty old but still they explain the concepts of Map Reduce , HDFS beautifully. The way in which tutorials have been defined into sub sections , each builing on previous one is awesome. I remember when i started i was digged in there for many days. The tutorials are lagging now from new API point of view. Yes, for that reason, for its beauty, this tutorial is read by many new Hadoop comers, so, I think that it need an update. Lets have some documentation session one day , I would love to Volunteer to update those tutorials if people at Yahoo take input from outside world :) I want to help on this too, so, we need to talk with Hadoop colleagues to do this. Regards and best wishes Regards, Jagat - Original Message - From: Marcos Ortiz Sent: 04/04/12 08:32 AM To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org', mapreduce-u...@hadoop.apache.org Subject: Yahoo Hadoop Tutorial with new APIs? Regards to all the list. There are many people that use the Hadoop Tutorial released by Yahoo at http://developer.yahoo.com/hadoop/tutorial/ http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining The main issue here is that, this tutorial is written with the old APIs? (Hadoop 0.18 I think). Is there a project for update this tutorial to the new APIs? to Hadoop 1.0.2 or YARN (Hadoop 0.23) Best wishes -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com http://www.uci.cu/ -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: opensuse 12.1
Like OpenSUSE is a RPM-based distribution, you can try with the Apache BigTop project [1], and look for the RPM packages and give them a try. You have noticed that the RPM specification between OpenSUSE and Red Hat-based distributions () change a little, but it can be a starting point. See the documentation for the project [2]. [1] http://incubator.apache.org/projects/bigtop.html [2] https://cwiki.apache.org/confluence/display/BIGTOP/Index%3bjsessionid=AA31645DFDAE1F3282D0159DB9B6AE9A Regards On 04/04/2012 12:24 PM, Raj Vishwanathan wrote: Lots of people seem to start with this. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ Raj From: Barry, Sean Fsean.f.ba...@intel.com To: common-user@hadoop.apache.orgcommon-user@hadoop.apache.org Sent: Wednesday, April 4, 2012 9:12 AM Subject: FW: opensuse 12.1 -Original Message- From: Barry, Sean F [mailto:sean.f.ba...@intel.com] Sent: Wednesday, April 04, 2012 9:10 AM To: common-user@hadoop.apache.org Subject: opensuse 12.1 What is the best way to install hadoop on opensuse 12.1 for a small two node cluster. -SB 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Yahoo Hadoop Tutorial with new APIs?
Ok, Robert, I will be waiting for you then. There are many folks that use this tutorial, so I think this a good effort in favor of the Hadoop community.It would be nice if Yahoo! donate this work, because, I have some ideas behind this, for example: to release a Spanish version of the tutorial. Regards and best wishes On 04/04/2012 05:29 PM, Robert Evans wrote: I am dropping the cross posts and leaving this on common-user with the others BCCed. Marcos, That is a great idea to be able to update the tutorial, especially if the community is interested in helping to do so. We are looking into the best way to do this. The idea right now is to donate this to the Hadoop project so that the community can keep it up to date, but we need some time to jump through all of the corporate hoops to get this to happen. We have a lot going on right now, so if you don't see any progress on this please feel free to ping me and bug me about it. -- Bobby Evans On 4/4/12 8:15 AM, Jagat Singh jagatsi...@gmail.com wrote: Hello Marcos Yes , Yahoo tutorials are pretty old but still they explain the concepts of Map Reduce , HDFS beautifully. The way in which tutorials have been defined into sub sections , each builing on previous one is awesome. I remember when i started i was digged in there for many days. The tutorials are lagging now from new API point of view. Lets have some documentation session one day , I would love to Volunteer to update those tutorials if people at Yahoo take input from outside world :) Regards, Jagat - Original Message - From: Marcos Ortiz Sent: 04/04/12 08:32 AM To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org %27hdfs-u...@hadoop.apache.org', mapreduce-u...@hadoop.apache.org Subject: Yahoo Hadoop Tutorial with new APIs? Regards to all the list. There are many people that use the Hadoop Tutorial released by Yahoo at http://developer.yahoo.com/hadoop/tutorial/ http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining The main issue here is that, this tutorial is written with the old APIs? (Hadoop 0.18 I think). Is there a project for update this tutorial to the new APIs? to Hadoop 1.0.2 or YARN (Hadoop 0.23) Best wishes -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com http://www.uci.cu/ http://www.uci.cu/ -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Yahoo Hadoop Tutorial with new APIs?
Regards to all the list. There are many people that use the Hadoop Tutorial released by Yahoo at http://developer.yahoo.com/hadoop/tutorial/ http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining The main issue here is that, this tutorial is written with the old APIs? (Hadoop 0.18 I think). Is there a project for update this tutorial to the new APIs? to Hadoop 1.0.2 or YARN (Hadoop 0.23) Best wishes -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Yahoo Hadoop Tutorial with new APIs?
On 04/04/2012 09:15 AM, Jagat Singh wrote: Hello Marcos Yes , Yahoo tutorials are pretty old but still they explain the concepts of Map Reduce , HDFS beautifully. The way in which tutorials have been defined into sub sections , each builing on previous one is awesome. I remember when i started i was digged in there for many days. The tutorials are lagging now from new API point of view. Yes, for that reason, for its beauty, this tutorial is read by many new Hadoop comers, so, I think that it need an update. Lets have some documentation session one day , I would love to Volunteer to update those tutorials if people at Yahoo take input from outside world :) I want to help on this too, so, we need to talk with Hadoop colleagues to do this. Regards and best wishes Regards, Jagat - Original Message - From: Marcos Ortiz Sent: 04/04/12 08:32 AM To: common-u...@hadoop.apache.org, 'hdfs-user@hadoop.apache.org', mapreduce-u...@hadoop.apache.org Subject: Yahoo Hadoop Tutorial with new APIs? Regards to all the list. There are many people that use the Hadoop Tutorial released by Yahoo at http://developer.yahoo.com/hadoop/tutorial/ http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining The main issue here is that, this tutorial is written with the old APIs? (Hadoop 0.18 I think). Is there a project for update this tutorial to the new APIs? to Hadoop 1.0.2 or YARN (Hadoop 0.23) Best wishes -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com http://www.uci.cu/ -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: How can I build a Collaborative Filtering recommendation framework based on mapreduce
Mahout is built precisely, so I think that you can evaluate it again. It has to two collaborating filtering algorithms: - Non-distributed recommenders (Taste) https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation - Distributed recommenders (Item-based) https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering - First-time FAQSs https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+First-Timer+FAQ About the test that you did with Mahout: - Which are the features of your machine? If you are working with 175M of data, a single machine is not the best way to do it. It's more worthy if you use small Hadoop cluster for this (1 NN/JT and 3 DN/TT), and then you can ask on the Mahout mailing list how to improve the performance of your system. Regards On 3/31/2012 6:17 AM, chao yin wrote: Hi all: I'm new to mapreduce, but familiar with Collaborative Filtering recommendation framework. I tried to use mahout to do this work. But it disappointed me. My machine work all day to do this job without any result with about 175M data. Is there anyone knows anything about Collaborative Filtering recommendation framework based on mapreduce, or mahout, any suggestion to improve performance? -- Best regards, Yin -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Job tracker service start issue.
On 03/23/2012 06:57 AM, kasi subrahmanyam wrote: Hi Oliver, I am not sure my suggestion might solve your problem or it might be already solved on your side. It seems the task tracker is having a problem accessing the tmp directory. Try going to the core and mapred site xml and change the tmp directory to a new one. If this is not yet working then manually change the permissions of theat directory using : chmod -R 777 tmp Please, don´t do chmod -R 777 in tmp directory. It´s not recommendable for production servers. The first option is more wise: 1- change the tmp directory in the core and mapreduce files 2- chown this new directory to group hadoop, where are the mapred and hdfs users On Fri, Mar 23, 2012 at 3:33 PM, Olivier Sallouolivier.sal...@irisa.frwrote: Le 3/23/12 8:50 AM, Manish Bhoge a écrit : I have Hadoop running on Standalone box. When I am starting deamon for namenode, secondarynamenode, job tracker, task tracker and data node, it is starting gracefully. But soon after it start job tracker it doesn't show up job tracker service. when i run 'jps' it is showing me all the services including task tracker except Job Tracker. Is there any time limit that need to set up or is it going into the safe mode. Because when i saw job tracker log this what it is showing, looks like it is starting the namenode but soon after it shutdown: 2012-03-22 23:26:04,061 INFO org.apache.hadoop.mapred.JobTracker: STARTUP_MSG: / STARTUP_MSG: Starting JobTracker STARTUP_MSG: host = manish/10.131.18.119 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.2-cdh3u3 STARTUP_MSG: build = file:///data/1/tmp/nightly_2012-02-16_09-46-24_3/hadoop-0.20-0.20.2+923.195-1~maverick -r 217a3767c48ad11d4632e19a22897677268c40c4; compiled by 'root' on Thu Feb 16 10:22:53 PST 2012 / 2012-03-22 23:26:04,140 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens 2012-03-22 23:26:04,141 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Starting expired delegation token remover thread, tokenRemoverScanInterval=60 min(s) 2012-03-22 23:26:04,141 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens 2012-03-22 23:26:04,142 INFO org.apache.hadoop.mapred.JobTracker: Scheduler configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT, limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1) 2012-03-22 23:26:04,143 INFO org.apache.hadoop.util.HostsFileReader: Refreshing hosts (include/exclude) list 2012-03-22 23:26:04,186 INFO org.apache.hadoop.mapred.JobTracker: Starting jobtracker with owner as mapred 2012-03-22 23:26:04,201 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 54311 2012-03-22 23:26:04,203 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=JobTracker, port=54311 2012-03-22 23:26:04,206 INFO org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC Metrics with hostName=JobTracker, port=54311 2012-03-22 23:26:09,250 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2012-03-22 23:26:09,298 INFO org.apache.hadoop.http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter) 2012-03-22 23:26:09,318 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50030 2012-03-22 23:26:09,318 INFO org.apache.hadoop.http.HttpServer: listener.getLocalPort() returned 50030 webServer.getConnectors()[0].getLocalPort() returned 50030 2012-03-22 23:26:09,318 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50030 2012-03-22 23:26:09,319 INFO org.mortbay.log: jetty-6.1.26.cloudera.1 2012-03-22 23:26:09,517 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50030 2012-03-22 23:26:09,519 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 2012-03-22 23:26:09,519 INFO org.apache.hadoop.mapred.JobTracker: JobTracker up at: 54311 2012-03-22 23:26:09,519 INFO org.apache.hadoop.mapred.JobTracker: JobTracker webserver: 50030 2012-03-22 23:26:09,648 WARN org.apache.hadoop.mapred.JobTracker: Failed to operate on mapred.system.dir (hdfs://localhost:54310/app/hadoop/tmp/mapred/system) because of permissions. 2012-03-22 23:26:09,648 WARN org.apache.hadoop.mapred.JobTracker: This directory should be owned by the user 'mapred (auth:SIMPLE)' 2012-03-22 23:26:09,650 WARN org.apache.hadoop.mapred.JobTracker: Bailing out ... org.apache.hadoop.security.AccessControlException: The systemdir
Apache Hadoop works with IPv6?
Regards. I'm very interested to know if Apache Hadoop works with IPv6 hosts. One of my clients has some hosts with this feature and they want to know if Hadoop supports this. Anyone has tested this? Best wishes -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Reduce copy speed too slow
Hi, Gayatri On 03/20/2012 11:59 AM, Gayatri Rao wrote: Hi all, I am running a map reduce job in EC2 instances and it seems to be very slow. It takes hours together for simple projection and aggregation of data. What filesystem are you using for data storage: HDFS in EC2 or Amazon S3? Which is the data size that you are analyzing? Upon observation, I gathered that the reduce copy speed is 0.01 MB/sec. I am new to hadoop. Could any one please share insights about the reduce copy speeds are good to work with. If any one has an experience any tips in improving it. Hadoop Map/Reduce jobs shuffle lots of data, so the recommended configuration is to use 10Gbps networks for the underline connection (and dedicated switches on dual-gigabit networks) Remember too that Hadoop is not a real-time system, if you need real-time random access to your data, use HBase http://hbase.apache.org Regards Thanks Gayatri 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Retry question
HDFS is precisely built with these concerns in mind. If you read a 60 GB file and the rack goes down, the system will present to you transparently another copy, based on your replication factor. A block can not be available too due to corruption, and in this case, it can be replicated to other live machines and fix the error with the fsck utility. Regards On 3/18/2012 9:46 AM, Rita wrote: My replication factor is 3 and if I were reading data thru libhdfs using C is there a retry method? I am reading a 60gb file and what would will happen if a rack goes down and the next block isn't available? Will the API retry? is there a way t configuration this option? -- --- Get your facts first, then you can distort them as you please.-- -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Best practice to setup Sqoop,Pig and Hive for a hadoop cluster ?
On 03/15/2012 09:22 AM, Manu S wrote: Thanks a lot Bijoy, that makes sense :) Suppose if I have Mysql database in some other node(not in hadoop cluster), can I import the tables using sqoop to my HDFS? Yes, this is the main purpose of Sqoop On the Cloudera site, you have the completed documentation for it Sqoop User Guide http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html Sqoop installation https://ccp.cloudera.com/display/CDHDOC/Sqoop+Installation Sqoop for MySQL http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_mysql Sqoop site on GitHub http://github.com/cloudera/sqoop Cloudera blog related post to Sqoop http://www.cloudera.com/blog/category/sqoop/ Best wishes On Thu, Mar 15, 2012 at 6:27 PM, Bejoy Ks bejoy.had...@gmail.com mailto:bejoy.had...@gmail.com wrote: Hi Manu Please find my responses inline I had read about we can install Pig, hive Sqoop on the client node, no need to install it in cluster. What is the client node actually? Can I use my management-node as a client? On larger clusters we have different node that is out of hadoop cluster and these stay in there. So user programs would be triggered from this node. This is the node refereed to as client node/ edge node etc . For your cluster management node and client node can be the same What is the best practice to install Pig, Hive, Sqoop? On a client node For the fully distributed cluster do we need to install Pig, Hive, Sqoop in each nodes? No, can be on a client node or on any of the nodes Mysql is needed for Hive as a metastore and sqoop can import mysql database to HDFS or hive or pig, so can we make use of mysql DB's residing on another node? Regarding your first point, SQOOP import is for different purpose, to get data from RDBNS into hdfs. But the meta stores is used by hive in framing the map reduce jobs corresponding to your hive query. Here SQOOP can't help you much Recommend to have the metastore db of hive on the same node where hive is installed as for execution hive queries there is meta data look up required much especially when your table has large number of partitions and all. Regards Bejoy.K.S On Thu, Mar 15, 2012 at 5:34 PM, Manu S manupk...@gmail.com mailto:manupk...@gmail.com wrote: Greetings All !!! I am using Cloudera CDH3 for Hadoop deployment. We have 7 nodes, in which 5 are used for a fully distributed cluster, 1 for pseudo-distributed 1 as management-node. Fully distributed cluster: HDFS, Mapreduce Hbase cluster Pseudo distributed mode: All I had read about we can install Pig, hive Sqoop on the client node, no need to install it in cluster. What is the client node actually? Can I use my management-node as a client? What is the best practice to install Pig, Hive, Sqoop? For the fully distributed cluster do we need to install Pig, Hive, Sqoop in each nodes? Mysql is needed for Hive as a metastore and sqoop can import mysql database to HDFS or hive or pig, so can we make use of mysql DB's residing on another node? -- Thanks Regards Manu S SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in http://www.opensourcetalk.co.in -- Thanks Regards Manu S SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in http://www.opensourcetalk.co.in -- Marcos Luis Ortíz Valmaseda Sr. Software Engineer (UCI) http://marcosluis2186.posterous.com http://postgresql.uci.cu/blog/38 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Best practice to setup Sqoop,Pig and Hive for a hadoop cluster ?
On 03/15/2012 09:22 AM, Manu S wrote: Thanks a lot Bijoy, that makes sense :) Suppose if I have Mysql database in some other node(not in hadoop cluster), can I import the tables using sqoop to my HDFS? Yes, this is the main purpose of Sqoop On the Cloudera site, you have the completed documentation for it Sqoop User Guide http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html Sqoop installation https://ccp.cloudera.com/display/CDHDOC/Sqoop+Installation Sqoop for MySQL http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_mysql Sqoop site on GitHub http://github.com/cloudera/sqoop Cloudera blog related post to Sqoop http://www.cloudera.com/blog/category/sqoop/ Best wishes On Thu, Mar 15, 2012 at 6:27 PM, Bejoy Ks bejoy.had...@gmail.com mailto:bejoy.had...@gmail.com wrote: Hi Manu Please find my responses inline I had read about we can install Pig, hive Sqoop on the client node, no need to install it in cluster. What is the client node actually? Can I use my management-node as a client? On larger clusters we have different node that is out of hadoop cluster and these stay in there. So user programs would be triggered from this node. This is the node refereed to as client node/ edge node etc . For your cluster management node and client node can be the same What is the best practice to install Pig, Hive, Sqoop? On a client node For the fully distributed cluster do we need to install Pig, Hive, Sqoop in each nodes? No, can be on a client node or on any of the nodes Mysql is needed for Hive as a metastore and sqoop can import mysql database to HDFS or hive or pig, so can we make use of mysql DB's residing on another node? Regarding your first point, SQOOP import is for different purpose, to get data from RDBNS into hdfs. But the meta stores is used by hive in framing the map reduce jobs corresponding to your hive query. Here SQOOP can't help you much Recommend to have the metastore db of hive on the same node where hive is installed as for execution hive queries there is meta data look up required much especially when your table has large number of partitions and all. Regards Bejoy.K.S On Thu, Mar 15, 2012 at 5:34 PM, Manu S manupk...@gmail.com mailto:manupk...@gmail.com wrote: Greetings All !!! I am using Cloudera CDH3 for Hadoop deployment. We have 7 nodes, in which 5 are used for a fully distributed cluster, 1 for pseudo-distributed 1 as management-node. Fully distributed cluster: HDFS, Mapreduce Hbase cluster Pseudo distributed mode: All I had read about we can install Pig, hive Sqoop on the client node, no need to install it in cluster. What is the client node actually? Can I use my management-node as a client? What is the best practice to install Pig, Hive, Sqoop? For the fully distributed cluster do we need to install Pig, Hive, Sqoop in each nodes? Mysql is needed for Hive as a metastore and sqoop can import mysql database to HDFS or hive or pig, so can we make use of mysql DB's residing on another node? -- Thanks Regards Manu S SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in http://www.opensourcetalk.co.in -- Thanks Regards Manu S SI Engineer - OpenSource HPC Wipro Infotech Mob: +91 8861302855Skype: manuspkd www.opensourcetalk.co.in http://www.opensourcetalk.co.in -- Marcos Luis Ortíz Valmaseda Sr. Software Engineer (UCI) http://marcosluis2186.posterous.com http://postgresql.uci.cu/blog/38 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Error while using libhdfs C API
On 03/09/2012 07:34 AM, Amritanshu Shekhar wrote: Hi Marcos, Figured out the compilation issue. It was due to error.h header file which was not used and not present in the distribution. There is one small issue however I was trying to test hdfs read. I copied an input file to /user/inputData(this can be listed using bin/hadoop dfs -ls /user/inputData). hdfsExists call fails for this directory however it works when I copy my file to /tmp. Is it because hdfs only recognizes /tmp as a valid dir? Thus I was wondering what directory structure does hdfs recognize by default and if we can override it through a conf variable what would that variable be and where to set it? Thanks, Amritanshu Awesome, Amritanshu. CC to hdfs-user@hadoop.apache.org Please, give some logs about your work with the compilation. How did you solve this? To have it on the mailing list archives. About your another issue, 1- Did you check that the $HADOOP_USER has access to /user/inputData? HDFS: It recognize the directory that you entered on the hdfs-site.xml on the dfs.name.dir(NN) property and on the dfs.data.dir (DN), but by default, it works with /tmp directory (not recommended in production). Look on the Eugene Ciuranas Refcard called "Deploying Hadoop", where he did a amazing work explaining in a few pages some tricky configurations tips. Regards From: Marcos Ortiz [mailto:mlor...@uci.cu] Sent: Wednesday, March 07, 2012 7:36 PM To: Amritanshu Shekhar Subject: Re: Error while using libhdfs C API On 03/07/2012 01:15 AM, Amritanshu Shekhar wrote: Hi Marcos, Thanks for the quick reply. Actually I am using a gmake build system where the library is being linked as a static library(.a ) rather than a shared object. It seems strange since stderr is a standard symbol which should be resolved. Currently I am using the version that came with the distribution($HOME/c++/Linux-amd64-64/lib/libhdfs.a) . I tried building the library from the source but there were build dependencies that could not be resolved. I tried building $HOME/hadoop/hdfs/src/c++/libhdfs by running: ./configure ./make I got a lot of dependency errors so gave up the effort. If you happen to have a working application that make suse of libhdfs please let me know. Any inputs would be welcome as I have hit a roadblock as far as libhdfs is concerned. Thanks, Amritanshu No, Amritansu. I don't have any examples of the use of libhdfs API, but I remembered that some folks were using it. Search on the mailing list archives (http://www.search-hadoop.com). Can you put the errors that you had in your system when you tried to compile the library? Regards and best wishes From: Marcos Ortiz [mailto:mlor...@uci.cu] Sent: Monday, March 05, 2012 6:51 PM To: hdfs-user@hadoop.apache.org Cc: Amritanshu Shekhar Subject: Re: Error while using libhdfs C API Which platform are you using? Did you update the dynamic linker runtime bindings (ldconfig)? ldconfig $HOME/hadoop/c++/Linux-amd64/lib Regards On 03/06/2012 02:38 AM, Amritanshu Shekhar wrote: Hi, I was trying to link 64 bit libhdfs in my application program but it seems there is an issue with this library. Get the following error: Undefined first referenced symbol in file stderr libhdfs.a(hdfs.o) __errno_location libhdfs.a(hdfs.o) ld: fatal: Symbol referencing errors. No output written to ../../bin/sun86/mapreduce collect2: ld returned 1 exit status Now I was wondering if this a common error and is there an actual issue with the library or am I getting an error because of an incorrect configuration? I am using the following library: $HOME/hadoop/c++/Linux-amd64-64/lib/libhdfs.a Thanks, Amritanshu -- Marcos Luis Ortz
Re: Hadoop 0.23.1 installation
On 03/01/2012 04:48 AM, raghavendhra rahul wrote: Hi, I tried to configure hadoop 0.23.1.I added all libs from share folder to lib directory.But still i get the error while formating the namenode Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/hdfs/server/namenode/NameNode Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.server.namenode.NameNode at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:321) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:266) Could not find the main class: org.apache.hadoop.hdfs.server.namenode.NameNode. Program will exit. Any help??? Can you show us here your .conf files? core-site.xml mapred-site.xml hdfs-site.xml Which is your configuration for your conf/hadoop-env.sh? Regards -- Marcos Luis Ortíz Valmaseda Sr. Software Engineer (UCI) http://marcosluis2186.posterous.com http://postgresql.uci.cu/blog/38 Fin a la injusticia, LIBERTAD AHORA A NUESTROS CINCO COMPATRIOTAS QUE SE ENCUENTRAN INJUSTAMENTE EN PRISIONES DE LOS EEUU! http://www.antiterroristas.cu http://justiciaparaloscinco.wordpress.com
Re: Query Regarding design MR job for Billing
On 02/27/2012 11:33 PM, Stuti Awasthi wrote: Hi Marcos, Thanks for the pointers. I am also thinking on the similar lines. I am doubtful at 1 point : I will be having separate data files for every interval. Let's take example if I have 5 mins interval file which contain data for 2 hours and 10 mins. In this scenario I want to process 2 hours data with hours job and 10 mins data with mins job. Now since I will provide my data file as Input to MR jobs so I think original file needs to split in 2 files : HourFile and MinsFile. HourFile wll contain data for 2 hours and MinsFile will conatin data for 10 mins. Well, you can with Oozie(http://yahoo.github.com/oozie/) or Cascading(http://cascading.org) for complex workflow programming. 1- For example, you can write a MapReduce job for spit your data: one by hour, and one by mins. In your case: a simple output would be one data file containing your data for 2 hours, and another data file for your 10 mins. I think that this job could be Mapper-only type with the MultipleOutputFormat. 2- Then you can write the different jobs for each interval (HourIntervalJob, MonthIntervalJob, etc), spliting its outputs depending of each interval in HDFS. You can define your complete workflow, and then, you can evaluate Oozie or Cascading to control that workflow. Regards Remember that all thes are suggestions. I'm not a MR expert I have attained file splitting with simple Java class but I think there is too much I/O operations and if I can attain this also in MR or in some efficient way, it will be good because the original data files can be huge and then the initial breaking of files will itself take too much time. Please suggest. Thanks -Original Message- From: Marcos Ortiz [mailto:mlor...@uci.cu] Sent: Sunday, February 26, 2012 7:40 PM To: mapreduce-user@hadoop.apache.org Cc: Stuti Awasthi Subject: Re: Query Regarding design MR job for Billing Well, first, you can design 6 MR jobs: 1- for 5 mins interval 2- for 1 hour 3- for 1 day 4- for 1 month 5- for 1 year 6- and a last for any interval If you say that for each interval, you have to do a different calculation; this way could be a solution (at least I think that). You can read the design patterns for MapReduce algorithms proposed by Jimmy Lin and Chris Dyer on his Data-Intensive Text Processing with MapReduce book. Regards On 02/27/2012 05:39 AM, Stuti Awasthi wrote: No. The data will be either of 5 mins interval, or 1 hour interval or 1 day interval and so on So suppose utilization is for 40 days then I will charge 30 days according to months billing and remaining 10 days as days billing job. -Original Message- From: Rohit Kelkar [mailto:rohitkel...@gmail.com] Sent: Monday, February 27, 2012 4:06 PM To: mapreduce-user@hadoop.apache.org Subject: Re: Query Regarding design MR job for Billing Just trying to understand your use case you need an hour job to run on data between 6:40 AM and 7:40 AM. Would it be like a moving window? For ex. run hour jobs on 6:41 AM to 7:41 AM 6:42 AM to 7:42 AM and so on... On Mon, Feb 27, 2012 at 1:01 PM, Stuti Awasthistutiawas...@hcl.com wrote: Hi all, I have to implement BillingEngine using MR jobs. My usecase is like this: I will be having data files of formatTimeStamp Information for Billing. Now these datafiles will be containing timestamp either at minute interval, hour inverval, day interval, month interval, year interval. Every type of interval will be having different type of calculation for billing so basically different jobs for every type of interval. Suppose I have a data file which contain minute interval timestamp. I have a scenario that if data is present for hours , then it should be processed by hourly job and remaining will be processed by minutejob. Example : 2/10/12 6:40 AMdata for billing 2/10/12 6:40 AMdata for billing . 2/10/12 6:45 AMdata for billing 2/10/12 6:45 AMdata for billing . . 2/10/12 7:40 AMdata for billing 2/10/12 7:40 AMdata for billing . . 2/10/12 7:45 AMdata for billing 2/10/12 7:45 AMdata for billing . Now I want data between 2/10/12 6:40 AM to 2/10/12 7:40 AM is processed by Hourjob and 2/10/12 7:45 AM is processed by MinuteJob. Please suggest how to design my MR to achieve this. Thanks Stuti ::DISCLAIMER:: - - - The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibited
Re: Query Regarding design MR job for Billing
Well, first, you can design 6 MR jobs: 1- for 5 mins interval 2- for 1 hour 3- for 1 day 4- for 1 month 5- for 1 year 6- and a last for any interval If you say that for each interval, you have to do a different calculation; this way could be a solution (at least I think that). You can read the design patterns for MapReduce algorithms proposed by Jimmy Lin and Chris Dyer on his Data-Intensive Text Processing with MapReduce book. Regards On 02/27/2012 05:39 AM, Stuti Awasthi wrote: No. The data will be either of 5 mins interval, or 1 hour interval or 1 day interval and so on So suppose utilization is for 40 days then I will charge 30 days according to months billing and remaining 10 days as days billing job. -Original Message- From: Rohit Kelkar [mailto:rohitkel...@gmail.com] Sent: Monday, February 27, 2012 4:06 PM To: mapreduce-user@hadoop.apache.org Subject: Re: Query Regarding design MR job for Billing Just trying to understand your use case you need an hour job to run on data between 6:40 AM and 7:40 AM. Would it be like a moving window? For ex. run hour jobs on 6:41 AM to 7:41 AM 6:42 AM to 7:42 AM and so on... On Mon, Feb 27, 2012 at 1:01 PM, Stuti Awasthistutiawas...@hcl.com wrote: Hi all, I have to implement BillingEngine using MR jobs. My usecase is like this: I will be having data files of formatTimeStamp Information for Billing. Now these datafiles will be containing timestamp either at minute interval, hour inverval, day interval, month interval, year interval. Every type of interval will be having different type of calculation for billing so basically different jobs for every type of interval. Suppose I have a data file which contain minute interval timestamp. I have a scenario that if data is present for hours , then it should be processed by hourly job and remaining will be processed by minutejob. Example : 2/10/12 6:40 AMdata for billing 2/10/12 6:40 AMdata for billing . 2/10/12 6:45 AMdata for billing 2/10/12 6:45 AMdata for billing . . 2/10/12 7:40 AMdata for billing 2/10/12 7:40 AMdata for billing . . 2/10/12 7:45 AMdata for billing 2/10/12 7:45 AMdata for billing . Now I want data between 2/10/12 6:40 AM to 2/10/12 7:40 AM is processed by Hourjob and 2/10/12 7:45 AM is processed by MinuteJob. Please suggest how to design my MR to achieve this. Thanks Stuti ::DISCLAIMER:: -- - The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect. -- - -- Marcos Luis Ortíz Valmaseda Senior Software Engineer (UCI) http://marcosluis2186.posterous.com http://www.linkedin.com/in/marcosluis2186 Twitter: @marcosluis2186 Fin a la injusticia, LIBERTAD AHORA A NUESTROS CINCO COMPATRIOTAS QUE SE ENCUENTRAN INJUSTAMENTE EN PRISIONES DE LOS EEUU! http://www.antiterroristas.cu http://justiciaparaloscinco.wordpress.com
Re: MapReduce jobs hanging or failing near completion
El 7/7/2011 8:43 PM, Kai Ju Liu escribió: Over the past week or two, I've run into an issue where MapReduce jobs hang or fail near completion. The percent completion of both map and reduce tasks is often reported as 100%, but the actual number of completed tasks is less than the total number. It appears that either tasks backtrack and need to be restarted or the last few reduce tasks hang interminably on the copy step. In certain cases, the jobs actually complete. In other cases, I can't wait long enough and have to kill the job manually. My Hadoop cluster is hosted in EC2 on instances of type c1.xlarge with 4 attached EBS volumes. The instances run Ubuntu 10.04.1 with the 2.6.32-309-ec2 kernel, and I'm currently using Cloudera's CDH3u0 distribution. Has anyone experienced similar behavior in their clusters, and if so, had any luck resolving it? Thanks! Can you post here your NN and DN logs files? Regards Kai Ju -- Marcos Luís Ortíz Valmaseda Software Engineer (UCI) Linux User # 418229 http://marcosluis2186.posterous.com http://twitter.com/marcosluis2186
Re: AW: How to split a big file in HDFS by size
Evert Lammerts at Sara.nl did something seemed to your problem, spliting a big 2.7 TB file to chunks of 10 GB. This work was presented on the BioAssist Programmers' Day on January of this year and its name was Large-Scale Data Storage and Processing for Scientist in The Netherlands http://www.slideshare.net/evertlammerts P.D: I sent the message with a copy to him El 6/20/2011 10:38 AM, Niels Basjes escribió: Hi, On Mon, Jun 20, 2011 at 16:13, Mapred Learnmapred.le...@gmail.com wrote: But this file is a gzipped text file. In this case, it will only go to 1 mapper than the case if it was split into 60 1 GB files which will make map-red job finish earlier than one 60 GB file as it will Hv 60 mappers running in parallel. Isn't it so ? Yes, that is very true. -- Marcos Luís Ortíz Valmaseda Software Engineer (UCI) http://marcosluis2186.posterous.com http://twitter.com/marcosluis2186
Re: Query about hadoop dfs -cat in hadoop-0-0.20.2
On 06/17/2011 07:41 AM, Lemon Cheng wrote: Hi, I am using the hadoop-0.20.2. After calling ./start-all.sh, i can type hadoop dfs -ls. However, when i type hadoop dfs -cat /usr/lemon/wordcount/input/file01, the error is shown as follow. I have searched the related problem in the web, but i can't find a solution for helping me to solve this problem. Anyone can give suggestion? Many Thanks. 11/06/17 19:27:12 INFO hdfs.DFSClient: No node available for block: blk_7095683278339921538_1029 file=/usr/lemon/wordcount/input/file01 11/06/17 19:27:12 INFO hdfs.DFSClient: Could not obtain block blk_7095683278339921538_1029 from any node: java.io.IOException: No live nodes contain current block 11/06/17 19:27:15 INFO hdfs.DFSClient: No node available for block: blk_7095683278339921538_1029 file=/usr/lemon/wordcount/input/file01 11/06/17 19:27:15 INFO hdfs.DFSClient: Could not obtain block blk_7095683278339921538_1029 from any node: java.io.IOException: No live nodes contain current block 11/06/17 19:27:18 INFO hdfs.DFSClient: No node available for block: blk_7095683278339921538_1029 file=/usr/lemon/wordcount/input/file01 11/06/17 19:27:18 INFO hdfs.DFSClient: Could not obtain block blk_7095683278339921538_1029 from any node: java.io.IOException: No live nodes contain current block 11/06/17 19:27:21 WARN hdfs.DFSClient: DFS Read: java.io.IOException: Could not obtain block: blk_7095683278339921538_1029 file=/usr/lemon/wordcount/input/file01 at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1812) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1638) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767) at java.io.DataInputStream.read(DataInputStream.java:83) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85) at org.apache.hadoop.fs.FsShell.printToStdout(FsShell.java:114) at org.apache.hadoop.fs.FsShell.access$100(FsShell.java:49) at org.apache.hadoop.fs.FsShell$1.process(FsShell.java:352) at org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1898) at org.apache.hadoop.fs.FsShell.cat http://org.apache.hadoop.fs.fsshell.cat/(FsShell.java:346) at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1543) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1761) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1880) Regards, Lemon Are you sure that all your DataNodes are online? -- Marcos Luís Ortíz Valmaseda Software Engineer (UCI) http://marcosluis2186.posterous.com http://twitter.com/marcosluis2186
Re: Query about hadoop dfs -cat in hadoop-0-0.20.2
On 06/17/2011 09:51 AM, Lemon Cheng wrote: Hi, Thanks for your reply. I am not sure that. How can I prove that? Which is your dfs.tmp.dir and dfs.data.dir values? You can check the DataNodes´s health with bin/slaves.sh jps | grep Datanode | sort Which is the output of bin/hadoop dfsadmin -report? One recomendation that I could say you is to have at least 1 NameNode and two Datanodes regards I checked the localhost:50070, it shows 1 live node and 0 dead node. And the log hadoop-appuser-datanode-localhost.localdomain.log shows: / 2011-06-17 19:59:38,658 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = localhost.localdomain/127.0.0.1 http://127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 / 2011-06-17 19:59:46,738 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Registered FSDatasetStatusMBean 2011-06-17 19:59:46,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened info server at 50010 2011-06-17 19:59:46,752 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwith is 1048576 bytes/s 2011-06-17 19:59:46,812 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2011-06-17 19:59:46,870 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50075 2011-06-17 19:59:46,871 INFO org.apache.hadoop.http.HttpServer: listener.getLocalPort() returned 50075 webServer.getConnectors()[0].getLocalPort() returned 50075 2011-06-17 19:59:46,871 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50075 2011-06-17 19:59:46,875 INFO org.mortbay.log: jetty-6.1.14 2011-06-17 20:01:45,702 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50075 http://SelectChannelConnector@0.0.0.0:50075 2011-06-17 20:01:45,709 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=DataNode, sessionId=null 2011-06-17 20:01:45,743 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=DataNode, port=50020 2011-06-17 20:01:45,751 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration = DatanodeRegistration(localhost.localdomain:50010, storageID=DS-993704729-127.0.0.1-50010-1308296320968, infoPort=50075, ipcPort=50020) 2011-06-17 20:01:45,751 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting 2011-06-17 20:01:45,753 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2011-06-17 20:01:45,754 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50020: starting 2011-06-17 20:01:45,754 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50020: starting 2011-06-17 20:01:45,754 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50020: starting 2011-06-17 20:01:45,795 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010 http://127.0.0.1:50010, storageID=DS-993704729-127.0.0.1-50010-1308296320968, infoPort=50075, ipcPort=50020)In DataNode.run, data = FSDataset{dirpath='/tmp/hadoop-appuser/dfs/data/current'} 2011-06-17 20:01:45,799 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: using BLOCKREPORT_INTERVAL of 360msec Initial delay: 0msec 2011-06-17 20:01:45,828 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 0 blocks got processed in 11 msecs 2011-06-17 20:01:45,833 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting Periodic block scanner. 2011-06-17 20:56:02,945 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 0 blocks got processed in 1 msecs 2011-06-17 21:56:02,248 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 0 blocks got processed in 1 msecs On Fri, Jun 17, 2011 at 9:42 PM, Marcos Ortiz mlor...@uci.cu mailto:mlor...@uci.cu wrote: On 06/17/2011 07:41 AM, Lemon Cheng wrote: Hi, I am using the hadoop-0.20.2. After calling ./start-all.sh, i can type hadoop dfs -ls. However, when i type hadoop dfs -cat /usr/lemon/wordcount/input/file01, the error is shown as follow. I have searched the related problem in the web, but i can't find a solution for helping me to solve this problem. Anyone can give suggestion? Many Thanks. 11/06/17 19:27:12 INFO hdfs.DFSClient: No node available for block: blk_7095683278339921538_1029 file=/usr/lemon/wordcount/input/file01 11/06/17 19:27:12 INFO hdfs.DFSClient: Could not obtain block blk_7095683278339921538_1029 from any node
Re: can't compile the mapreduce project in eclipse
Did you add all dependencies of the source code? El 6/14/2011 10:32 AM, Erix Yao escribió: hi I checked out the source code from http://svn.apache.org/repos/asf/hadoop/mapreduce/tags/release-0.21.0 and execute ant compile eclipse-files , but after import the project into eclipse, I found the error as below: Description Resource Path Location Type Project 'mapreduce' is missing required library: 'build/ivy/lib/Hadoop/common/avro-1.3.0.jar' mapreduce Build path Build Path Problem Here, Avro jar is missing Description Resource Path Location Type Project 'mapreduce' is missing required source folder: 'src/contrib/sqoop/src/java' mapreduce Build path Build Path Problem And here, sqoop is missing. I don't why this library is required for this, but, it seems to be the problem. You should add all the required dependencies on your CLASSPATH variables on Windows-Preferences-Java-Build Path-Classpath Variables Please, go to the Cloudera Resources Site and you can search the Eclipse/Hadoop screencast that explains these details easy and quickly how to build the hadoop project. Regards -- Marcos Luís Ortíz Valmaseda Software Engineer (UCI) http://marcosluis2186.posterous.com http://twitter.com/marcosluis2186
Re: Programming Multiple rounds of mapreduce
Well, you can define a job for each round and then, you can define the running workflow based in your implementation and to chain your jobs El 6/13/2011 5:46 PM, Arko Provo Mukherjee escribió: Hello, I am trying to write a program where I need to write multiple rounds of map and reduce. The output of the last round of map-reduce must be fed into the input of the next round. Can anyone please guide me to any link / material that can teach me as to how I can achieve this. Thanks a lot in advance! Thanks regards Arko -- Marcos Luís Ortíz Valmaseda Software Engineer (UCI) http://marcosluis2186.posterous.com http://twitter.com/marcosluis2186
Re: Input examples
You can use the HackReduce's datasets too for this. http://hackreduce.org/datasets Regards El 6/7/2011 1:56 PM, Jonathan Coveney escribió: Have you taken a look at the O'Reilly Hadoop book? It deals consistently with a weather dataset that is, I believe, largely available. 2011/6/7 Francesco De Luca f.deluc...@gmail.com mailto:f.deluc...@gmail.com Hello Sean, not exactely. I mean some applications like word count or inverted index and the relative input data. 2011/6/7 Sean Owen sro...@gmail.com mailto:sro...@gmail.com Not sure if it's quite what you mean, but, Apache Mahout is essentially all applications of Hadoop for machine learning, a bunch of runnable jobs (some with example data too). mahout.apache.org http://mahout.apache.org/ On Tue, Jun 7, 2011 at 3:54 PM, Francesco De Luca f.deluc...@gmail.com mailto:f.deluc...@gmail.com wrote: Where i can find some hadoop map reduce application examples (except word count) with associate input files? Thanks -- Marcos Luís Ortíz Valmaseda Software Engineer (UCI) http://marcosluis2186.posterous.com http://twitter.com/marcosluis2186
Re: Changing dfs.block.size
Another advice here, is that you can test the right block size with a seemed enviroment to your production system, before to deploy the real system, and then, you can avoid these kinds of changes. El 6/6/2011 3:09 PM, J. Ryan Earl escribió: Hello, So I have a question about changing dfs.block.size in $HADOOP_HOME/conf/hdfs-site.xml. I understand that when files are created, blocksizes can be modified from default. What happens if you modify the blocksize of an existing HDFS site? Do newly created files get the default blocksize and old files remain the same? Is there a way to change the blocksize of existing files; I'm assuming you could write MapReduce job to do it, but any build in facilities? Thanks, -JR -- Marcos Luís Ortíz Valmaseda Software Engineer (UCI) http://marcosluis2186.posterous.com http://twitter.com/marcosluis2186
Re: cant remove files from tmp
How many DN you have? If this number is more than 1, check this in another DN to see if this happens too there. Check the /var/log/messages or dmesg (like Todd said you) with this for example: (this is one of my Ubuntu servers): less dmesg | grep EXT4-fs [1.583836] EXT4-fs (sda7): INFO: recovery required on readonly filesystem [1.583843] EXT4-fs (sda7): write access will be enabled during recovery [2.572935] EXT4-fs (sda7): orphan cleanup on readonly fs [2.620969] EXT4-fs (sda7): ext4_orphan_cleanup: deleting unreferenced inode 455946 [2.621015] EXT4-fs (sda7): ext4_orphan_cleanup: deleting unreferenced inode 455942 [2.621029] EXT4-fs (sda7): 2 orphan inodes deleted [2.621034] EXT4-fs (sda7): recovery complete [2.785283] EXT4-fs (sda7): mounted filesystem with ordered data mode. Opts: (null) [ 22.041130] EXT4-fs (sda7): re-mounted. Opts: errors=remount-ro [ 22.505474] EXT4-fs (sda8): mounted filesystem with ordered data mode. Opts: (null) Regards El 6/6/2011 4:43 PM, Todd Lipcon escribió: Hi Prem, My guess is that your Linux filesystem on this partition is corrupt. Check dmesg for output indicating fs-level errors. -Todd On Mon, Jun 6, 2011 at 1:23 PM, Jain, Prem premanshu.j...@netapp.com mailto:premanshu.j...@netapp.com wrote: Mapuser or hdfs user didn't seem to help, so I switched to root: [root@hadoop20 mapred]# ls -la /part/data total 0 drwx-- 3 hdfs hadoop 16 Jun 6 10:22 . drwxrwxrwx 4 hdfs hadoop 47 May 26 18:36 .. drwxr-xr-x 4 mapred mapred 35 May 26 21:02 tmp [root@hadoop20 mapred]# [root@hadoop20 mapred]# pwd /part/data/tmp/distcache/642114211252449475_2038269146_799583695/hmaster/user/mapred [root@hadoop20 mapred]# ls -la total 0 drwxr-xr-x 3 mapred mapred 22 Jun 6 12:46 . drwxr-xr-x 3 mapred mapred 19 May 26 21:17 .. ?- ? ? ? ?? input-dir -Original Message- From: Marcos Ortiz [mailto:mlor...@uci.cu mailto:mlor...@uci.cu] Sent: Monday, June 06, 2011 1:17 PM To: hdfs-user@hadoop.apache.org mailto:hdfs-user@hadoop.apache.org Cc: Jain, Prem Subject: Re: cant remove files from tmp * Why are using he root user for these operations? * Which are your permisions on your data directory? (ls -la /part/data)? Regards El 6/6/2011 3:41 PM, Jain, Prem escribió: I have a wrecked datanode which is giving me hard time restarting. It keeps complaining of Datanode dead, pid file exists. I already tried deleting the files but seems like the files are corrupted and don't allow me delete. Here is the log: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = hadoop20/192.168.1.190 http://192.168.1.190 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.2-cdh3u0 STARTUP_MSG: build = -r 81256ad0f2e4ab2bd34b04f53d25a6c23686dd14; compiled by 'root' on Fri Mar 25 20:07:24 EDT 2011 / 2011-06-06 09:11:01,232 INFO org.apache.hadoop.security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. 2011-06-06 09:11:01,369 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access `/part/data/tmp/distcache/642114211252449475_2038269146_79 9583695/hmaster/user/mapred/input-dir': No such file or directory du: cannot read directory `/part/data/tmp/mapred/jobcache/job_201105261845_0005': Permission denied _ Here is the file I can't delete _ [root@hadoop20 distcache]# pwd /part/data/tmp/distcache [root@hadoop20 distcache]# ls -la total 0 drwxr-xr-x 3 mapred mapred 52 May 26 21:36 . drwxr-xr-x 4 mapred mapred 35 May 26 21:02 .. drwxr-xr-x 3 mapred mapred 20 May 26 21:17 642114211252449475_2038269146_799583695 [root@hadoop20 distcache]# cd * [root@hadoop20 642114211252449475_2038269146_799583695]# ls -la total 0 drwxr-xr-x 3 mapred mapred 20 May 26 21:17 . drwxr-xr-x 3 mapred mapred 52 May 26 21:36 .. drwxr-xr-x 3 mapred mapred 17 May 26 21:17 hmaster [root@hadoop20 642114211252449475_2038269146_799583695]# cd h* [root@hadoop20 hmaster]# ls user [root@hadoop20 hmaster]# cd * [root@hadoop20 user]# ls -la total 0 drwxr-xr-x 3 mapred mapred 19 May 26 21:17 . drwxr-xr-x 3 mapred mapred 17 May 26 21:17 .. drwxr-xr-x 3 mapred mapred 22 May 26 21:17 mapred [root@hadoop20 user]# cd m* [root@hadoop20 mapred]# ls -la
Re: question about using java in streaming mode
Why are using Java in streming mode instead use the native Mapper/Reducer code? Can you show to us the JobTracker's logs? Regards - Mensaje original - De: Siddhartha Jonnalagadda sid@gmail.com Para: mapreduce-user@hadoop.apache.org Enviados: Domingo, 5 de Junio 2011 7:16:08 GMT +01:00 Amsterdam / Berlín / Berna / Roma / Estocolmo / Viena Asunto: question about using java in streaming mode Hi, I was able use streaming in hadoop using python for the wordcount program, but created a Mapper and Reducer in Java since all my code is currently in Java. I first tried this: echo “foo foo quux labs foo bar quux” |java -cp ~/dummy.jar WCMapper | sort | java -cp ~/dummy.jar WCReducer It gave the correct output: labs 1 foo 3 bar 1 quux 2 Then, I installed a single-node cluster in hadoop and tried this: hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper “java -cp ~/dummy.jar WCMapper” -reducer “java -cp ~/dummy.jar WCReducer” -input gutenberg/* -output gutenberg-output -file dummy.jar (by tailoring the python command) This is the error: hadoop@siddhartha-laptop:/usr/local/hadoop$ hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper “java -cp ~/dummy.jar WCMapper” -reducer “java -cp ~/dummy.jar WCReducer” -input gutenberg/* -output gutenberg-output -file dummy.jar packageJobJar: [dummy.jar, /app/hadoop/tmp/hadoop-unjar5573454211442575176/] [] /tmp/streamjob6721719460213928092.jar tmpDir=null 11/06/04 20:47:15 INFO mapred.FileInputFormat: Total input paths to process : 3 11/06/04 20:47:15 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local] 11/06/04 20:47:15 INFO streaming.StreamJob: Running job: job_201106031901_0039 11/06/04 20:47:15 INFO streaming.StreamJob: To kill this job, run: 11/06/04 20:47:15 INFO streaming.StreamJob: /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201106031901_0039 11/06/04 20:47:15 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201106031901_0039 11/06/04 20:47:16 INFO streaming.StreamJob: map 0% reduce 0% 11/06/04 20:48:00 INFO streaming.StreamJob: map 100% reduce 100% 11/06/04 20:48:00 INFO streaming.StreamJob: To kill this job, run: 11/06/04 20:48:00 INFO streaming.StreamJob: /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_201106031901_0039 11/06/04 20:48:00 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201106031901_0039 11/06/04 20:48:00 ERROR streaming.StreamJob: Job not successful. Error: NA 11/06/04 20:48:00 INFO streaming.StreamJob: killJob… Streaming Job Failed! Any advice? Sincerely, Siddhartha Jonnalagadda, Text mining Researcher, Lnx Research, LLC, Orange, CA sjonnalagadda.wordpress.com Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. -- Marcos Luís Ortíz Valmaseda Software Engineer (Large-Scaled Distributed Systems) http://marcosluis2186.posterous.com
Re: question about using java in streaming mode
El 6/5/2011 4:01 PM, Siddhartha Jonnalagadda escribió: Hi Marcos, I thought that streaming would make it easier because I was getting different errors with extending mapper and reducer in java. I tried: hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar -file dummy.jar -mapper java -cp dummy.jar WCMapper -reducer java -cp dummy.jar WCReducer -input gutenberg/* -output gutenberg-output The error log in the map task: *_stderr logs_* Exception in thread main java.lang.NoClassDefFoundError: WCMapper Caused by: java.lang.ClassNotFoundException: WCMapper Which is the definition of your ClassPath? Because, this error is caused where the system can not find the definition of a class. at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) Could not find the main class: WCMapper. Program will exit. java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545) at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:121) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:435) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:371) at org.apache.hadoop.mapred.Child$4.run(Child.java:259) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:253) java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:435) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:371) at org.apache.hadoop.mapred.Child$4.run(Child.java:259) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:253) Sincerely, Siddhartha Jonnalagadda, sjonnalagadda.wordpress.com http://sjonnalagadda.wordpress.com Confidentiality Notice: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. On Sun, Jun 5, 2011 at 10:59 AM, Marcos Ortiz Valmaseda mlor...@uci.cu mailto:mlor...@uci.cu wrote: Why are using Java in streming mode instead use the native Mapper/Reducer code? Can you show to us the JobTracker's logs? Regards - Mensaje original - De: Siddhartha Jonnalagadda sid@gmail.com mailto:sid@gmail.com Para: mapreduce-user@hadoop.apache.org mailto:mapreduce-user@hadoop.apache.org Enviados: Domingo, 5 de Junio 2011 7:16:08 GMT +01:00 Amsterdam / Berlín / Berna / Roma / Estocolmo / Viena Asunto: question about using java in streaming mode Hi, I was able use streaming in hadoop using python for the wordcount program, but created a Mapper and Reducer in Java since all my code is currently in Java. I first tried this: echo “foo foo quux labs foo bar quux” |java -cp ~/dummy.jar WCMapper | sort | java -cp ~/dummy.jar WCReducer It gave the correct output: labs 1 foo 3 bar 1 quux 2 Then, I installed a single-node cluster in hadoop and tried this: hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper “java -cp ~/dummy.jar WCMapper” -reducer “java -cp ~/dummy.jar WCReducer
Re: Unable to start hadoop-0.20.2 but able to start hadoop-0.20.203 cluster
On 05/31/2011 10:06 AM, Xu, Richard wrote: 1 namenode, 1 datanode. Dfs.replication=3. We also tried 0, 1, 2, same result. *From:*Yaozhen Pan [mailto:itzhak@gmail.com] *Sent:* Tuesday, May 31, 2011 10:34 AM *To:* hdfs-user@hadoop.apache.org *Subject:* Re: Unable to start hadoop-0.20.2 but able to start hadoop-0.20.203 cluster How many datanodes are in your cluster? and what is the value of dfs.replication in hdfs-site.xml (if not specified, default value is 3)? From the error log, it seems there are not enough datanodes to replicate the files in hdfs. 在 2011 5 31 22:23,Harsh J ha...@cloudera.com mailto:ha...@cloudera.com写道: Xu, Please post the output of `hadoop dfsadmin -report` and attach the tail of a started DN's log? On Tue, May 31, 2011 at 7:44 PM, Xu, Richard richard...@citi.com mailto:richard...@citi.com wrote: 2. Also, Configured Cap... This might easily be the cause. I'm not sure if its a Solaris thing that can lead to this though. 3. in datanode server, no error in logs, but tasktracker logs has the following suspicious thing:... I don't see any suspicious log message in what you'd posted. Anyhow, the TT does not matter here. -- Harsh J Regards, Xu When you installed on Solaris: - Did you syncronize the ntp server on all nodes: echo server youservernetp.com /etc/inet/ntp.conf svcadm enable svc:/network/ntp:default - Are you using the same Java version on both systems (Ubuntu and Solaris)? - Can you test with one NN and two DN? -- Marcos Luis Ortiz Valmaseda Software Engineer (Distributed Systems) http://uncubanitolinuxero.blogspot.com
Re: MultipleOutputs Files remain in temporary folder
On 05/30/2011 11:02 AM, Panayotis Antonopoulos wrote: Hello, I just noticed that the files that are created using MultipleOutputs remain in the temporary folder into attempt sub-folders when there is no normal output (using context.write(...)). Has anyone else noticed that? Is there any way to change that and make the files appear in the output directory? Thank you in advance! Panagiotis. |mapred.local.dir| This lets the MapReduce servers know where to store intermediate files. This may be a comma-separated list of directories to spread the load. Make sure there’s enough space here for all your intermediate files. We share the same disks for MapReduce and HDFS. |mapred.system.dir| This is a folder in the|defaultFS|where MapReduce stores some control files. In our case that would be a directory in HDFS. If you have|dfs.permissions|(which it is by default) enabled make sure that this directory exists and is owned by mapred:hadoop. |mapred.temp.dir| This is a folder to store temporary files in. It is hardly -- if at all used. If I understand the description correctly this is supposed to be in HDFS but I’m not entirely sure by reading the source code. So we set this to a directory that exists on the local filesystem as well as in HDFS. -- Marcos Luis Ortiz Valmaseda Software Engineer (Distributed Systems) http://uncubanitolinuxero.blogspot.com
Re: run hadoop pseudo-distribute examples failed
On 05/19/2011 10:35 PM, 李�S wrote: Hi Marcos, Thanks for your reply. The temporary directory '/tmp/hadoop-xxx' is defined in hadoop core jar's configuration file *core-default.xml*. Do u think this may cause the failure? Bellow is the detail config: property namehadoop.tmp.dir/name value/tmp/hadoop-${user.name}/value descriptionA base for other temporary directories./description /property And what's the other config files do u need? Almostly, I didn't modify any configuration after downloading the hadoop-0.20.2 files, I think those configuration are all the default values. Yes, those are the default values, but I think that you can test with another directory because this is a temporary directory , and it can be erased easy. For example, when you use the CDH3, the default value there is /var/lib/hadoop-0.20.2/cache/${user.name}, which is more convenient. Of course, it's a recommendation. You can search the Lars Francke's Blog (http://blog.lars-francke.de/) where he did a excellent work explaining the manual installation of a Hadoop Cluster. Regards 2011-05-20 李�S *发件人:* Marcos Ortiz *发送时间:* 2011-05-19 20:40:06 *收件人:* mapreduce-user *抄送:* 李�S *主题:* Re: run hadoop pseudo-distribute examples failed On 05/18/2011 10:53 PM, 李�S wrote: Hi All, I'm trying to run hadoop(0.20.2) examples in Pseudo-Distributed Mode following the hadoop user guide. After I run the 'start-all.sh', it seems the namenode can't connect to datanode. 'SSH localhost' is OK on my server. Someone advises to rm '/tmp/hadoop-' and format namenode again, but it doesn't work. And 'iptables -L' shows there is no firewall rules in my server: test:/home/liyun2010# iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination Is there anyone can give me more advice? Thanks! Bellow is my namenode and datanode log files: liyun2010@test:~/hadoop-0.20.2/logs$ mailto:liyun2010@test:%7E/hadoop-0.20.2/logs$ cat hadoop-liyun2010-namenode-test.puppet.com.log 2011-05-19 10:58:25,938 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = test.puppet.com/127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 / 2011-05-19 10:58:26,197 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=9000 2011-05-19 10:58:26,212 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: test.puppet.com/127.0.0.1:9000 2011-05-19 10:58:26,220 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2011-05-19 10:58:26,224 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2011-05-19 10:58:26,405 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=liyun2010,users 2011-05-19 10:58:26,406 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2011-05-19 10:58:26,406 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2011-05-19 10:58:26,429 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 2011-05-19 10:58:26,434 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2011-05-19 10:58:26,511 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 9 2011-05-19 10:58:26,524 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 1 2011-05-19 10:58:26,530 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 920 loaded in 0 seconds. 2011-05-19 10:58:26,606 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Invalid opcode, reached end of edit log Number of transactions found 99 2011-05-19 10:58:26,606 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /tmp/hadoop-liyun2010/dfs/name/current/edits of size 1049092 edits # 99 loaded in 0 seconds. 2011-05-19 10:58:26,660
Re: Starting Datanode
On 05/20/2011 03:46 PM, Anh Nguyen wrote: On 05/20/2011 01:15 PM, Marcos Ortiz wrote: On 05/20/2011 01:02 PM, Anh Nguyen wrote: Hi, I just upgraded to hadoop-0.20.203.0, and am having problem starting the datanode: # hadoop datanode Unrecognized option: -jvm Could not create the Java virtual machine. It looks like it has something to do with daemon.sh, particularly the setting of HADOOP_OPTS: if [[ $EUID -eq 0 ]]; then HADOOP_OPTS=$HADOOP_OPTS -jvm server $HADOOP_DATANODE_OPTS else HADOOP_OPTS=$HADOOP_OPTS -server $HADOOP_DATANODE_OPTS fi Am I missing something? Thanks in advance. Anh- Which Java's version are you using? # java -version java version 1.6.0_20 Java(TM) SE Runtime Environment (build 1.6.0_20-b02) Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode) It worked with hadoop-0.20.2. Anh- How do you are starting the services? using bin/start-all.sh o simply the datanode? -- Marcos Luís Ortíz Valmaseda Software Engineer (Large-Scaled Distributed Systems) University of Information Sciences, La Habana, Cuba Linux User # 418229 http://about.me/marcosortiz
Re: Starting Datanode
On 05/20/2011 04:08 PM, Anh Nguyen wrote: On 05/20/2011 02:06 PM, Marcos Ortiz wrote: On 05/20/2011 04:27 PM, Marcos Ortiz wrote: On 05/20/2011 03:46 PM, Anh Nguyen wrote: On 05/20/2011 01:15 PM, Marcos Ortiz wrote: On 05/20/2011 01:02 PM, Anh Nguyen wrote: Hi, I just upgraded to hadoop-0.20.203.0, and am having problem starting the datanode: # hadoop datanode Unrecognized option: -jvm Could not create the Java virtual machine. It looks like it has something to do with daemon.sh, particularly the setting of HADOOP_OPTS: if [[ $EUID -eq 0 ]]; then HADOOP_OPTS=$HADOOP_OPTS -jvm server $HADOOP_DATANODE_OPTS else HADOOP_OPTS=$HADOOP_OPTS -server $HADOOP_DATANODE_OPTS fi Am I missing something? Thanks in advance. Anh- Which Java's version are you using? # java -version java version 1.6.0_20 Java(TM) SE Runtime Environment (build 1.6.0_20-b02) Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode) It worked with hadoop-0.20.2. Anh- How do you are starting the services? using bin/start-all.sh o simply the datanode? Anh, test first that this option (-jvm is included in that Java version). Tested earlier: # java -jvm Unrecognized option: -jvm Could not create the Java virtual machine. Did you check the requirements for that release? I don´t know if this version require at least a mayor version to 1.6.20. Did you test with the 1.6.24? I think that can be a bug. Take a time to review the last issues for Hadoop on the JIRA of the project. Regards -- Marcos Luís Ortíz Valmaseda Software Engineer (Large-Scaled Distributed Systems) University of Information Sciences, La Habana, Cuba Linux User # 418229 http://about.me/marcosortiz
Re: Starting Datanode
On 05/20/2011 04:08 PM, Anh Nguyen wrote: On 05/20/2011 02:06 PM, Marcos Ortiz wrote: On 05/20/2011 04:27 PM, Marcos Ortiz wrote: On 05/20/2011 03:46 PM, Anh Nguyen wrote: On 05/20/2011 01:15 PM, Marcos Ortiz wrote: On 05/20/2011 01:02 PM, Anh Nguyen wrote: Hi, I just upgraded to hadoop-0.20.203.0, and am having problem starting the datanode: # hadoop datanode Unrecognized option: -jvm Could not create the Java virtual machine. It looks like it has something to do with daemon.sh, particularly the setting of HADOOP_OPTS: if [[ $EUID -eq 0 ]]; then HADOOP_OPTS=$HADOOP_OPTS -jvm server $HADOOP_DATANODE_OPTS else HADOOP_OPTS=$HADOOP_OPTS -server $HADOOP_DATANODE_OPTS fi Am I missing something? Thanks in advance. Anh- Which Java's version are you using? # java -version java version 1.6.0_20 Java(TM) SE Runtime Environment (build 1.6.0_20-b02) Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode) It worked with hadoop-0.20.2. Anh- How do you are starting the services? using bin/start-all.sh o simply the datanode? Anh, test first that this option (-jvm is included in that Java version). Tested earlier: # java -jvm Unrecognized option: -jvm Could not create the Java virtual machine. Did you check the requirements for that release? I don´t know if this version require at least a mayor version to 1.6.20. Did you test with the 1.6.24? I think that can be a bug. Take a time to review the last issues for Hadoop on the JIRA of the project. Regards -- Marcos Luís Ortíz Valmaseda Software Engineer (Large-Scaled Distributed Systems) University of Information Sciences, La Habana, Cuba Linux User # 418229 http://about.me/marcosortiz
Re: Profiling Hadoop Code
On 05/19/2011 04:26 AM, Shuja Rehman wrote: Hi All, I was investigating the ways to profile the hadoop code. All I found is to use JobConf.setProfileEnabled(boolean) http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setProfileEnabled%28boolean%29 but i believe this is not available in the new api. so can anybody let me know how i can profile my hadoop code to get details which part is taking what time to tune the application? Thanks -- Regards Shuja-ur-Rehman Baig Version 0.20.2 Location:/hadoop-0.20.2/src/mapred/org/apache/hadoop/mapred/JobConf.java /** * Get whether the task profiling is enabled. * @return true if some tasks will be profiled */ public boolean getProfileEnabled() { return getBoolean(mapred.task.profile, false); } /** * Set whether the system should collect profiler information for some of * the tasks in this job? The information is stored in the user log * directory. * @param newValue true means it should be gathered */ public void setProfileEnabled(boolean newValue) { setBoolean(mapred.task.profile, newValue); } /** * Get the profiler configuration arguments. * * The default value for this property is * -agentlib:hprof=cpu=samples,heap=sites,force=n,thread=y,verbose=n,file=%s * * @return the parameters to pass to the task child to configure profiling */ public String getProfileParams() { return get(mapred.task.profile.params, -agentlib:hprof=cpu=samples,heap=sites,force=n,thread=y, + verbose=n,file=%s); } /** * Set the profiler configuration arguments. If the string contains a '%s' it * will be replaced with the name of the profiling output file when the task * runs. * * This value is passed to the task child JVM on the command line. * * @param value the configuration string */ public void setProfileParams(String value) { set(mapred.task.profile.params, value); } /** * Get the range of maps or reduces to profile. * @param isMap is the task a map? * @return the task ranges */ public IntegerRanges getProfileTaskRange(boolean isMap) { return getRange((isMap ? mapred.task.profile.maps : mapred.task.profile.reduces), 0-2); } /** * Set the ranges of maps or reduces to profile. setProfileEnabled(true) * must also be called. * @param newValue a set of integer ranges of the map ids */ public void setProfileTaskRange(boolean isMap, String newValue) { // parse the value to make sure it is legal new Configuration.IntegerRanges(newValue); set((isMap ? mapred.task.profile.maps : mapred.task.profile.reduces), newValue); } -- Marcos Luís Ortíz Valmaseda Software Engineer (Large-Scaled Distributed Systems) University of Information Sciences, La Habana, Cuba Linux User # 418229 http://about.me/marcosortiz
Re: FW: NNbench and MRBench
El 5/8/2011 12:46 AM, stanley@emc.com escribió: Thanks Marcos. This post of Michael Noll does provide some information about how to run these benchmarks, but there's not much information about how to evaluate the results. Do you know some resources about the result analysis? Thanks very much :) Regards, Stanley -Original Message- From: Marcos Ortiz [mailto:mlor...@uci.cu] Sent: 2011年5月8日 11:09 To: mapreduce-user@hadoop.apache.org Cc: Shi, Stanley Subject: Re: FW: NNbench and MRBench El 5/7/2011 10:33 PM, stanley@emc.com escribió: Thanks, Marcos, Through these links, I still can't find anything about the NNbench and MRBench. -Original Message- From: Marcos Ortiz [mailto:mlor...@uci.cu] Sent: 2011年5月8日 10:23 To: mapreduce-user@hadoop.apache.org Cc: Shi, Stanley Subject: Re: FW: NNbench and MRBench El 5/7/2011 8:53 PM, stanley@emc.com escribió: Hi guys, I have a cluster of 16 machines running Hadoop. Now I want to do some benchmark on this cluster with the nnbench and mrbench. I'm new to the hadoop thing and have no one to refer to. I don't know what the supposed result should I have? Now for mrbench, I have an average time of 22sec for a one map job. Is this too bad? What the supposed results might be? For nnbench, what's the supposed results? Below is my result. Datetime: 2011-05-05 20:40:25,459 Test Operation: rename Start time: 2011-05-05 20:40:03,820 Maps to run: 1 Reduces to run: 1 Block Size (bytes): 1 Bytes to write: 0 Bytes per checksum: 1 Number of files: 1 Replication factor: 1 Successful file operations: 1 # maps that missed the barrier: 0 # exceptions: 0 TPS: Rename: 1763 Avg Exec time (ms): Rename: 0.5672 Avg Lat (ms): Rename: 0.4844 null RAW DATA: AL Total #1: 4844 RAW DATA: AL Total #2: 0 RAW DATA: TPS Total (ms): 5672 RAW DATA: Longest Map Time (ms): 5672.0 RAW DATA: Late maps: 0 RAW DATA: # of exceptions: 0 = One more question, when I set maps number to bigger, I get all zeros results: = Test Operation: create_write Start time: 2011-05-03 23:22:39,239 Maps to run: 160 Reduces to run: 160 Block Size (bytes): 1 Bytes to write: 0 Bytes per checksum: 1 Number of files: 1 Replication factor: 1 Successful file operations: 0 # maps that missed the barrier: 0 # exceptions: 0 TPS: Create/Write/Close: 0 Avg exec time (ms): Create/Write/Close: 0.0 Avg Lat (ms): Create/Write: NaN Avg Lat (ms): Close: NaN RAW DATA: AL Total #1: 0 RAW DATA: AL Total #2: 0 RAW DATA: TPS Total (ms): 0 RAW DATA: Longest Map Time (ms): 0.0 RAW DATA: Late maps: 0 RAW DATA: # of exceptions: 0 = Can anyone point me to some documents? I really appreciate your help :) Thanks, stanley You can use these resources: http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/ http://answers.oreilly.com/topic/460-how-to-benchmark-a-hadoop-cluster/ http://wiki.apache.org/hadoop/HardwareBenchmarks http://www.quora.com/Apache-Hadoop/Are-there-any-good-Hadoop-benchmark-problems Regards Well, on the Micheal Noll's post says this: NameNode benchmark (nnbench) === NNBench (see src/test/org/apache/hadoop/hdfs/NNBench.java) is useful for load testing the NameNode hardware and configuration. It generates a lot of HDFS-related requests with normally very small payloads for the sole purpose of putting a high HDFS management stress on the NameNode. The benchmark can simulate requests for creating, reading, renaming and deleting files on HDFS. I like to run this test simultaneously from several machines -- e.g. from a set of DataNode boxes -- in order to hit the NameNode from multiple locations at the same time. The syntax of NNBench is as follows: NameNode Benchmark 0.4 Usage: nnbenchoptions Options: -operationAvailable operations are create_write open_read rename delete. This option is mandatory * NOTE: The open_read, rename and delete
Re: FW: NNbench and MRBench
El 5/7/2011 8:53 PM, stanley@emc.com escribió: Hi guys, I have a cluster of 16 machines running Hadoop. Now I want to do some benchmark on this cluster with the nnbench and mrbench. I'm new to the hadoop thing and have no one to refer to. I don't know what the supposed result should I have? Now for mrbench, I have an average time of 22sec for a one map job. Is this too bad? What the supposed results might be? For nnbench, what's the supposed results? Below is my result. Date time: 2011-05-05 20:40:25,459 Test Operation: rename Start time: 2011-05-05 20:40:03,820 Maps to run: 1 Reduces to run: 1 Block Size (bytes): 1 Bytes to write: 0 Bytes per checksum: 1 Number of files: 1 Replication factor: 1 Successful file operations: 1 # maps that missed the barrier: 0 # exceptions: 0 TPS: Rename: 1763 Avg Exec time (ms): Rename: 0.5672 Avg Lat (ms): Rename: 0.4844 null RAW DATA: AL Total #1: 4844 RAW DATA: AL Total #2: 0 RAW DATA: TPS Total (ms): 5672 RAW DATA: Longest Map Time (ms): 5672.0 RAW DATA: Late maps: 0 RAW DATA: # of exceptions: 0 = One more question, when I set maps number to bigger, I get all zeros results: = Test Operation: create_write Start time: 2011-05-03 23:22:39,239 Maps to run: 160 Reduces to run: 160 Block Size (bytes): 1 Bytes to write: 0 Bytes per checksum: 1 Number of files: 1 Replication factor: 1 Successful file operations: 0 # maps that missed the barrier: 0 # exceptions: 0 TPS: Create/Write/Close: 0 Avg exec time (ms): Create/Write/Close: 0.0 Avg Lat (ms): Create/Write: NaN Avg Lat (ms): Close: NaN RAW DATA: AL Total #1: 0 RAW DATA: AL Total #2: 0 RAW DATA: TPS Total (ms): 0 RAW DATA: Longest Map Time (ms): 0.0 RAW DATA: Late maps: 0 RAW DATA: # of exceptions: 0 = Can anyone point me to some documents? I really appreciate your help :) Thanks, stanley You can use these resources: http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/ http://answers.oreilly.com/topic/460-how-to-benchmark-a-hadoop-cluster/ http://wiki.apache.org/hadoop/HardwareBenchmarks http://www.quora.com/Apache-Hadoop/Are-there-any-good-Hadoop-benchmark-problems Regards -- Marcos Luís Ortíz Valmaseda Software Engineer (Large-Scaled Distributed Systems) University of Information Sciences, La Habana, Cuba Linux User # 418229 http://about.me/marcosortiz
Re: FW: NNbench and MRBench
El 5/7/2011 10:33 PM, stanley@emc.com escribió: Thanks, Marcos, Through these links, I still can't find anything about the NNbench and MRBench. -Original Message- From: Marcos Ortiz [mailto:mlor...@uci.cu] Sent: 2011年5月8日 10:23 To: mapreduce-user@hadoop.apache.org Cc: Shi, Stanley Subject: Re: FW: NNbench and MRBench El 5/7/2011 8:53 PM, stanley@emc.com escribió: Hi guys, I have a cluster of 16 machines running Hadoop. Now I want to do some benchmark on this cluster with the nnbench and mrbench. I'm new to the hadoop thing and have no one to refer to. I don't know what the supposed result should I have? Now for mrbench, I have an average time of 22sec for a one map job. Is this too bad? What the supposed results might be? For nnbench, what's the supposed results? Below is my result. Date time: 2011-05-05 20:40:25,459 Test Operation: rename Start time: 2011-05-05 20:40:03,820 Maps to run: 1 Reduces to run: 1 Block Size (bytes): 1 Bytes to write: 0 Bytes per checksum: 1 Number of files: 1 Replication factor: 1 Successful file operations: 1 # maps that missed the barrier: 0 # exceptions: 0 TPS: Rename: 1763 Avg Exec time (ms): Rename: 0.5672 Avg Lat (ms): Rename: 0.4844 null RAW DATA: AL Total #1: 4844 RAW DATA: AL Total #2: 0 RAW DATA: TPS Total (ms): 5672 RAW DATA: Longest Map Time (ms): 5672.0 RAW DATA: Late maps: 0 RAW DATA: # of exceptions: 0 = One more question, when I set maps number to bigger, I get all zeros results: = Test Operation: create_write Start time: 2011-05-03 23:22:39,239 Maps to run: 160 Reduces to run: 160 Block Size (bytes): 1 Bytes to write: 0 Bytes per checksum: 1 Number of files: 1 Replication factor: 1 Successful file operations: 0 # maps that missed the barrier: 0 # exceptions: 0 TPS: Create/Write/Close: 0 Avg exec time (ms): Create/Write/Close: 0.0 Avg Lat (ms): Create/Write: NaN Avg Lat (ms): Close: NaN RAW DATA: AL Total #1: 0 RAW DATA: AL Total #2: 0 RAW DATA: TPS Total (ms): 0 RAW DATA: Longest Map Time (ms): 0.0 RAW DATA: Late maps: 0 RAW DATA: # of exceptions: 0 = Can anyone point me to some documents? I really appreciate your help :) Thanks, stanley You can use these resources: http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/ http://answers.oreilly.com/topic/460-how-to-benchmark-a-hadoop-cluster/ http://wiki.apache.org/hadoop/HardwareBenchmarks http://www.quora.com/Apache-Hadoop/Are-there-any-good-Hadoop-benchmark-problems Regards Well, on the Micheal Noll's post says this: NameNode benchmark (nnbench) === NNBench (see src/test/org/apache/hadoop/hdfs/NNBench.java) is useful for load testing the NameNode hardware and configuration. It generates a lot of HDFS-related requests with normally very small payloads for the sole purpose of putting a high HDFS management stress on the NameNode. The benchmark can simulate requests for creating, reading, renaming and deleting files on HDFS. I like to run this test simultaneously from several machines -- e.g. from a set of DataNode boxes -- in order to hit the NameNode from multiple locations at the same time. The syntax of NNBench is as follows: NameNode Benchmark 0.4 Usage: nnbench options Options: -operation Available operations are create_write open_read rename delete. This option is mandatory * NOTE: The open_read, rename and delete operations assume that the files they operate on, are already available. The create_write operation must be run before running the other operations. -maps number of maps. default is 1. This is not mandatory -reduces number of reduces. default is 1. This is not mandatory -startTime time to start, given in seconds from the epoch. Make sure this is far enough into the future, so all maps (operations) will start at the same time. default is launch time + 2 mins. This is not mandatory -blockSize Block size in bytes
Re: Other FS Pointer?
For example: * Amazon S3 (Amazon Simple Storage Service): http://aws.amazon.com/s3/ On the Hadoop wiki, there is a competed guide to work Hadoop with Amazon S3 http://wiki.apache.org/hadoop/AmazonS3 * IBM GPFS: http://www.ibm.com/systems/gpfs/ https://issues.apache.org/jira/browse/HADOOP-6330 http://www.almaden.ibm.com/StorageSystems/projects/gpfs/ * CloudStore: http://kosmosfs.sourceforge.net/ * ASter Data ´s Integration with Hadoop: http://www.asterdata.com/news/091001-Aster-Hadoop-connector.php Regards - Mensaje original - De: Anh Nguyen angu...@redhat.com Para: hdfs-user@hadoop.apache.org Enviados: Miércoles, 4 de Mayo 2011 12:57:24 (GMT-0500) Auto-Detected Asunto: Other FS Pointer? Hi, Can anyone point me to a doc describing how to port/use another clustered FS? Thanks. Anh- -- Marcos Luís Ortíz Valmaseda Software Engineer Universidad de las Ciencias Informáticas Linux User # 418229 http://uncubanitolinuxero.blogspot.com http://www.linkedin.com/in/marcosluis2186
Re: hadoop branch-0.20-append Build error:build.xml:933: exec returned: 1
El 4/11/2011 10:45 PM, Alex Luya escribió: BUILD FAILED .../branch-0 .20-append/build.xml:927: The following error occurred while executing this line: ../branch-0 .20-append/build.xml:933: exec returned: 1 Total time: 1 minute 17 seconds + RESULT=1 + '[' 1 '!=' 0 ']' + echo 'Build Failed: 64-bit build not run' Build Failed: 64-bit build not run + exit 1 - I checked content in file build.xml: line 927:antcall target=cn-docs//targettarget name=cn-docs depends=forrest.check, init description=Generate forrest-based Chinese documentation. To use, specify -Dforrest.home=lt;base of Apache Forrest installationgt; on the command line. if=forrest.home line 933:exec dir=${src.docs.cn} executable=${forrest.home}/bin/forrest failonerror=true --- It seems try to execute forrest,what is the problem here?I am running a 64bit ubuntu,with 64+32bit-jdk-1.6 and 64-bit-jdk-1.5 installed.Some guys told there are some tricks in this page:http://wiki.apache.org/hadoop/HowToRelease to get forrest build to work.But I can't find any tricks in the page. Any help is appreciated. 1- Which version of Java do you have on the JAVA_HOME variable? You can browse on the Forrest page to get how you can build it: http://forrest.apache.org 2- another question for you: Do you actually need Forrest? Regards -- Marcos Luís Ortíz Valmaseda Software Engineer (Large-Scaled Distributed Systems) University of Information Sciences, La Habana, Cuba Linux User # 418229
Re: Question regarding datanode been wiped by hadoop
El 4/12/2011 10:46 AM, felix gao escribió: What reason/condition would cause a datanode’s blocks to be removed? Our cluster had a one of its datanodes crash because of bad RAM. After the system was upgraded and the datanode/tasktracker brought online the next day we noticed the amount of space utilized was minimal and the cluster was rebalancing blocks to the datanode. It would seem the prior blocks were removed. Was this because the datanode was declared dead? What is the criteria for a namenode to decide (Assuming its the namenode) when a datanode should remove prior blocks? 1- Did you check the DataNode´s logs? 2- Did you protect the NameNode´s dfs.name.dir and the dfs.edits.dir ´s directories? On these directories, the NameNode stores the file system image and the second is where the edit log or journal is written. A good practice for these directories is to have them on RAID 1 or RAID 10 to guarantize the consistency of your cluster. Any data loss in these directories (dfs.name.dir and dfs.edits.dir) will result in a loss of data in your HDFS. So, the second good practice is to have a secondary NameNode to setup in any case that the primary NameNode fails. Another thing to keep in mind, is that when the NameNode fails, you have to restar the JobTracker and the TaskTrackers after that the NameNode will be restarted. Regards -- Marcos Luís Ortíz Valmaseda Software Engineer (Large-Scaled Distributed Systems) University of Information Sciences, La Habana, Cuba Linux User # 418229
Re: hadoop branch-0.20-append Build error:build.xml:933: exec returned: 1
El 4/11/2011 10:45 PM, Alex Luya escribió: BUILD FAILED .../branch-0 .20-append/build.xml:927: The following error occurred while executing this line: ../branch-0 .20-append/build.xml:933: exec returned: 1 Total time: 1 minute 17 seconds + RESULT=1 + '[' 1 '!=' 0 ']' + echo 'Build Failed: 64-bit build not run' Build Failed: 64-bit build not run + exit 1 - I checked content in file build.xml: line 927:antcall target=cn-docs//targettarget name=cn-docs depends=forrest.check, init description=Generate forrest-based Chinese documentation. To use, specify -Dforrest.home=lt;base of Apache Forrest installationgt; on the command line. if=forrest.home line 933:exec dir=${src.docs.cn} executable=${forrest.home}/bin/forrest failonerror=true --- It seems try to execute forrest,what is the problem here?I am running a 64bit ubuntu,with 64+32bit-jdk-1.6 and 64-bit-jdk-1.5 installed.Some guys told there are some tricks in this page:http://wiki.apache.org/hadoop/HowToRelease to get forrest build to work.But I can't find any tricks in the page. Any help is appreciated. 1- Which version of Java do you have on the JAVA_HOME variable? You can browse on the Forrest page to get how you can build it: http://forrest.apache.org 2- another question for you: Do you actually need Forrest? Regards -- Marcos Luís Ortíz Valmaseda Software Engineer (Large-Scaled Distributed Systems) University of Information Sciences, La Habana, Cuba Linux User # 418229
Re: hadoop branch-0.20-append Build error:build.xml:933: exec returned: 1
El 4/11/2011 10:45 PM, Alex Luya escribió: BUILD FAILED .../branch-0 .20-append/build.xml:927: The following error occurred while executing this line: ../branch-0 .20-append/build.xml:933: exec returned: 1 Total time: 1 minute 17 seconds + RESULT=1 + '[' 1 '!=' 0 ']' + echo 'Build Failed: 64-bit build not run' Build Failed: 64-bit build not run + exit 1 - I checked content in file build.xml: line 927:antcall target=cn-docs//targettarget name=cn-docs depends=forrest.check, init description=Generate forrest-based Chinese documentation. To use, specify -Dforrest.home=lt;base of Apache Forrest installationgt; on the command line. if=forrest.home line 933:exec dir=${src.docs.cn} executable=${forrest.home}/bin/forrest failonerror=true --- It seems try to execute forrest,what is the problem here?I am running a 64bit ubuntu,with 64+32bit-jdk-1.6 and 64-bit-jdk-1.5 installed.Some guys told there are some tricks in this page:http://wiki.apache.org/hadoop/HowToRelease to get forrest build to work.But I can't find any tricks in the page. Any help is appreciated. 1- Which version of Java do you have on the JAVA_HOME variable? You can browse on the Forrest page to get how you can build it: http://forrest.apache.org 2- another question for you: Do you actually need Forrest? Regards -- Marcos Luís Ortíz Valmaseda Software Engineer (Large-Scaled Distributed Systems) University of Information Sciences, La Habana, Cuba Linux User # 418229
Re: mapred.min.split.size
El 3/18/2011 3:54 PM, Pedro Costa escribió: Hi What's the purpose of the parameter mapred.min.split.size? Thanks, There are many parameters that control the number of map tasks for a Job, and mapred.min.split.size controls the minimun size of a split. Other parameters are: - mapreduce.map.tasks: The suggested number of map tasks - dfs.block.size: the file system block size in bytes of the input file Regards -- Marcos Luís Ortíz Valmaseda Software Engineer Universidad de las Ciencias Informáticas Linux User # 418229 http://uncubanitolinuxero.blogspot.com http://www.linkedin.com/in/marcosluis2186
Re: Lost Task Tracker because of no heartbeat
On Wed, 2011-03-16 at 17:50 +0100, baran cakici wrote: Hi Everyone, I make a Project with Hadoop-MapRedeuce for my master-Thesis. I have a strange problem on my System. First of all, I use Hadoop-0.20.2 on Windows XP Pro with Eclipse Plug-In. When I start a job with big Input(4GB - it`s may be not to big, but algorithm require some time), then i lose my Task Tracker in several minutes or seconds. I mean, Seconds since heartbeat increase and then after 600 Seconds I lose TaskTracker. I read somewhere, that can be occured because of small number of open files (ulimit -n). I try to increase this value, but i can write as max value in Cygwin 3200.(ulimit -n 3200) and default value is 256. Actually I don`t know, is it helps or not. In my job and task tracker.log have I some Errors, I posted those to. Jobtracker.log -Call to localhost/127.0.0.1:9000 failed on local exception: java.io.IOException: An existing connection was forcibly closed by the remote host another : - 2011-03-15 12:13:30,718 INFO org.apache.hadoop.mapred.JobTracker: attempt_201103151143_0002_m_91_0 is 97125 ms debug. 2011-03-15 12:16:50,718 INFO org.apache.hadoop.mapred.JobTracker: attempt_201103151143_0002_m_91_0 is 297125 ms debug. 2011-03-15 12:20:10,718 INFO org.apache.hadoop.mapred.JobTracker: attempt_201103151143_0002_m_91_0 is 497125 ms debug. 2011-03-15 12:23:30,718 INFO org.apache.hadoop.mapred.JobTracker: attempt_201103151143_0002_m_91_0 is 697125 ms debug. Error launching task Lost tracker 'tracker_apple:localhost/127.0.0.1:2654' there are my logs(jobtracker.log, tasktracker.log ...) in attachment I need really Help, I don`t have so much time for my Thessis. Thanks a lot for your Helps, Baran Regards, Baran I was analyzing your logs and I have several questions: 1- On the hadoop-Baran-jobtracker-apple.log you have this: Cleaning up the system directory 2011-03-15 01:18:44,468 INFO org.apache.hadoop.mapred.JobTracker: problem cleaning system directory: hdfs://localhost:9000/cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/system org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/system. Name node is in safe mode. This is a notice that you are doing something wrong with HDFS. Can you provide the output of: hadoop dfsadmin -report on the NameNode? Regards -- Marcos Luís Ortíz Valmaseda Software Engineer Centro de Tecnologías de Gestión de Datos (DATEC) Universidad de las Ciencias Informáticas http://uncubanitolinuxero.blogspot.com http://www.linkedin.com/in/marcosluis2186
Re: Lost Task Tracker because of no heartbeat
On Thu, 2011-03-17 at 00:19 +0530, Harsh J wrote: On Thu, Mar 17, 2011 at 12:42 AM, Marcos Ortiz mlor...@uci.cu wrote: 2011-03-15 01:18:44,468 INFO org.apache.hadoop.mapred.JobTracker: problem cleaning system directory: hdfs://localhost:9000/cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/system org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/system. Name node is in safe mode. Marcos, the JT keeps attempting to clear the mapred.system.dir on the DFS at startup, and fails because the NameNode wasn't ready when it tried (and thereby reattempts after a time, and passes later when NN is ready for some editing action). This is mostly because Baran is issuing a start-all/stop-all instead of a simple start/stop of mapred components. Thanks a lot, Harsh for the response. I think that's a good entry to add to the Problems/Solutions section on the Hadoop Wiki. Regards -- Marcos Luís Ortíz Valmaseda Software Engineer Centro de Tecnologías de Gestión de Datos (DATEC) Universidad de las Ciencias Informáticas http://uncubanitolinuxero.blogspot.com http://www.linkedin.com/in/marcosluis2186
Re: cloudera CDH3 error: namenode running,but:Error: JAVA_HOME is not set and Java could not be found
On Wed, 2011-03-16 at 23:19 +0800, Alex Luya wrote: I download cloudera CDH3 beta:hadoop-0.20.2+228,and modified three files:hdfs.xml,core-site.xml and hadoop-env.sh.and I do have set JAVA_HOME in file:hadoop-env.sh,and then try to run:start-dfs.sh,got this error,but strange thing is that namenode is running.I can't understand why.Any help is appreciate. I think that this questions is for the cdh-users mailing list, but I will try to help you? 1- Are you sure that you installed the Java Development Kit(JDK 1.6+)? 2- Which is your environment? - Operating System - Architecture 3- Can you check the Cloudera Documentation about: Installing and configuring CDH3? http://docs.cloudera.com/ 4- If you did a new user account(recommended) called hadoop. Did you check that on environment of this user, did you set the JAVA_HOME variable? If you are using a Linux distribution supported by the Cloudera´s Team, I recommend you that you should use the official repositories. http://archives.cloudera.com You can check the last news on the blog, where they talked about the new Linux distributions supported by CDH3: - Debian 5/6 - Ubuntu 10.10 - Red Hat 5.4 - CentOS 5 - SUSE EL 11 The last recommendation is to check the DZone RefCard from Eugene Ciurana(http://eugeneciurana.eu) and the Cloudera´Team called Apache Hadoop Deployment:A Blueprint for Reliable Distributed Computing Regards, -- Marcos Luís Ortíz Valmaseda Software Engineer Centro de Tecnologías de Gestión de Datos (DATEC) Universidad de las Ciencias Informáticas http://uncubanitolinuxero.blogspot.com http://www.linkedin.com/in/marcosluis2186
Re: Could not obtain block
El 3/9/2011 6:27 AM, Evert Lammerts escribió: We see a lot of IOExceptions coming from HDFS during a job that does nothing but untar 100 files (1 per Mapper, sizes vary between 5GB and 80GB) that are in HDFS, to HDFS. DataNodes are also showing Exceptions that I think are related. (See stacktraces below.) This job should not be able to overload the system I think... I realize that much data needs to go over the lines, but HDFS should still be responsive. Any ideas / help is much appreciated! Some details: * Hadoop 0.20.2 (CDH3b4) * 5 node cluster plus 1 node for JT/NN (Sun Thumpers) * 4 cores/node, 4GB RAM/core * CentOS 5.5 Job output: java.io.IOException: java.io.IOException: Could not obtain block: blk_-3695352030358969086_130839 file=/user/emeij/icwsm-data-test/01-26-SOCIAL_MEDIA.tar.gz Which is the ouput of: bin/hadoop dfsadmin -report Which is the output of: bin/hadoop fsck /user/emeij/icwsm-data-test/ at ilps.DownloadICWSM$UntarMapper.map(DownloadICWSM.java:449) at ilps.DownloadICWSM$UntarMapper.map(DownloadICWSM.java:1) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:390) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324) at org.apache.hadoop.mapred.Child$4.run(Child.java:240) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.mapred.Child.main(Child.java:234) Caused by: java.io.IOException: Could not obtain block: blk_-3695352030358969086_130839 file=/user/emeij/icwsm-data-test/01-26-SOCIAL_MEDIA.tar.gz Which is the ouput of: bin/hadoop fsck /user/emeij/icwsm-data-test/01-26-SOCIAL_MEDIA.tar.gz --files -blocks -racks at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1977) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1784) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1932) at java.io.DataInputStream.read(DataInputStream.java:83) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:55) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:335) at ilps.DownloadICWSM$CopyThread.run(DownloadICWSM.java:149) Example DataNode Exceptions (not that these come from the node at 192.168.28.211): 2011-03-08 19:40:40,297 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-9222067946733189014_3798233 java.io.EOFException: while trying to read 3067064 bytes 2011-03-08 19:40:41,018 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.211:50050, dest: /192.168.28.211:49748, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201103071120_0030_m_32_0, offset: 30 72, srvID: DS-568746059-145.100.2.180-50050-1291128670510, blockid: blk_3596618013242149887_4060598, duration: 2632000 2011-03-08 19:40:41,049 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-9221028436071074510_2325937 java.io.EOFException: while trying to read 2206400 bytes 2011-03-08 19:40:41,348 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-9221549395563181322_4024529 java.io.EOFException: while trying to read 3037288 bytes 2011-03-08 19:40:41,357 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-9221885906633018147_3895876 java.io.EOFException: while trying to read 1981952 bytes 2011-03-08 19:40:41,434 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_-9221885906633018147_3895876 unfinalized and removed. 2011-03-08 19:40:41,434 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-9221885906633018147_3895876 received exception java.io.EOFException: while trying to read 1981952 bytes 2011-03-08 19:40:41,434 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.211:50050, storageID=DS-568746059-145.100.2.180-50050-1291128670510, infoPort=50075, ipcPort=50020):DataXceiver java.io.EOFException: while trying to read 1981952 bytes at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417) at
Re: Could not obtain block
El 3/9/2011 11:09 AM, Evert Lammerts escribió: I didn't mention it but the complete filesystem is reported healthy by fsck. I'm guessing that the java.io.EOFException indicates a problem caused by the load of the job. Any ideas? It's a very tricky work to debug a MapReduce Job execution but I'll try. java.io.EOFException: while trying to read 1981952 bytes at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122) 2011-03-08 19:40:41,465 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_-9221549395563181322_4024529 unfinalized and removed. 1- Did you check this? 2- Which are the file permisions on /user/emeij/icwsm-data-test/ ? If the fsck command gives that all is fine, really I don't know more. Regards -- Marcos Luís Ortíz Valmaseda Software Engineer Universidad de las Ciencias Informáticas Linux User # 418229 http://uncubanitolinuxero.blogspot.com http://www.linkedin.com/in/marcosluis2186
how to use hadoop apis with cloudera distribution ?
On Tue, 2011-03-08 at 07:16 -0800, Mapred Learn wrote: Hi, I downloaded CDH3 VM for hadoop but if I want to use something like: import org.apache.hadoop.conf.Configuration; in my java code, what else do I need to do ? Can you see all tutorial that Cloudera has on its site http://www.cloudera.com/presentations http://www.cloudera.com/info/training http://www.cloudera.com/developers/learn-hadoop/ Can you check the CDH3 Official Documentation and the last news about the new release: http://docs.cloudera.com http://www.cloudera.com/blog/category/cdh/ Do i need to download hadoop from apache ? No, CDH beta 3 has with all required tools to work with Hadoop, even more applications like HUE, Oozie, Zookepper, Pig, Hive, Chukwa, HBase, Flume, etc if yes, then what does cdh3 do ? The Cloudera' colleagues has a excelent work packaging the most used applications with Hadoop on a single virtual machine for testing and they did a better approach to use Hadoop. They has Red Hat and Ubuntu/Debian compatible packages to do more easy the installation, configuration and use of Hadoop on these operating systems. Please, read http://docs.cloudera.com if not, then where can i find hadoop code on cdh VM ? I am using above line in my java code in eclipse and eclipse is not able to find it. Do you set JAVA_HOME, and HADOOP_HOME on your system? If you have any doubt with this, you can check the excellent DZone' refcards about Getting Started with Hadood and Deploying Hadoop written by Eugene Ciurana(http://eugeneciurana.eu), VP of Technology at Badoo.com Regards, and I hope that this information could be useful for you. -- Marcos Luís Ortíz Valmaseda Software Engineer Centro de Tecnologías de Gestión de Datos (DATEC) Universidad de las Ciencias Informáticas http://uncubanitolinuxero.blogspot.com http://www.linkedin.com/in/marcosluis2186
Re: Dataset comparison and ranking - views
On Tue, 2011-03-08 at 10:51 +0530, Sonal Goyal wrote: Hi Marcos, Thanks for replying. I think I was not very clear in my last post. Let me describe my use case in detail. I have two datasets coming from different sources, lets call them dataset1 and dataset2. Both of them contain records for entities, say Person. A single record looks like: First Name Last Name, Street, City, State,Zip We want to compare each record of dataset1 with each record of dataset2, in effect a cross join. We know that the way data is collected, names will not match exactly, but we want to find close enoughs. So we have a rule which says create bigrams and find the matching bigrams. If 0 to 5 match, give a score of 10, if 5-15 match, give a score of 20 and so on. Well, a approach for this problem has a solution given by Milind Bhandarkar, on his presentation called Practical Problem Solving with Hadoop and Pig. He talk about a solution for Bigrams giving a example with word matching. Bigrams Input: A large text corpus • Output: List(word , Top (word )) • Two Stages: • Generate all possible bigrams • Find most frequent K bigrams for each word Bigrams: Stage 1 Map === • Generate all possible Bigrams • Map Input: Large text corpus • Map computation • In each sentence, or each “word word ” • Output (word , word ), (word , word ) • Partition Sort by (word , word ) pairs.pl while(STDIN) { chomp; $_ =~ s/[^a-zA-Z]+/ /g ; $_ =~ s/^\s+//g ; $_ =~ s/\s+$//g ; $_ =~ tr/A-Z/a-z/; my @words = split(/\s+/, $_); for (my $i = 0; $i $#words - 1; ++$i) { print $words[$i]:$words[$i+1]\n; print $words[$i+1]:$words[$i]\n; } } Bigrams: Stage 1 Reduce == • Input: List(word , word ) sorted and partitioned • Output: List(word , [freq, word ]) • Counting similar to Unigrams example count.pl $_ = STDIN; chomp; my ($pw1, $pw2) = split(/:/, $_); $count = 1; while(STDIN) { chomp; my ($w1, $w2) = split(/:/, $_); if ($w1 eq $pw1 $w2 eq $pw2) { $count++; } else { print $pw1:$count:$pw2\n; $pw1 = $w1; $pw2 = $w2; $count = 1; } } print $pw1:$count:$pw2\n; Bigrams: Stage 2 Map === • Input: List(word , [freq,word ]) • Output: List(word , [freq, word ]) • Identity Mapper (/bin/cat) • Partition by word • Sort descending by (word , freq) Bigrams: Stage 2 Reduce == • Input: List(word , [freq,word ]) • partitioned by word • sorted descending by (word , freq) • Output: Top (List(word , [freq, word ])) • For each word, throw away after K records firstN.pl $N = 5; $_ = STDIN; chomp; my ($pw1, $count, $pw2) = split(/:/, $_); $idx = 1; $out = $pw1\t$pw2,$count;; while(STDIN) { chomp; my ($w1, $c, $w2) = split(/:/, $_); if ($w1 eq $pw1) { if ($idx $N) { $out .= $w2,$c;; $idx++; } } else { print $out\n; $pw1 = $w1; $idx = 1; $out = $pw1\t$w2,$c;; } } print $out\n; You can translate this approach to your especific problem. I recommend you that you discuss this with him because he has a vast experience with all this, much more than me. Regards For Zip, we have our rule saying exact match or within 5 kms of each other(through a lookup), give a score of 50 and so on. Once we have each person of dataset1 compared with that of dataset2, we find the overall rank. Which is a weighted average of scores of name, address etc comparison. One approach is to use the DistributedCache for the smaller dataset and do a nested loop join in the mapper. The second approach is to use multiple MR flows, and compare the fields and reduce/collate the results. I am curious to know if people have other approaches they have implemented, what are the efficiencies they have built up etc. Thanks and Regards, Sonal Hadoop ETL and Data Integration Nube Technologies On Tue, Mar 8, 2011 at 12:55 AM, Marcos Ortiz mlor...@uci.cu wrote: On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote: Hi, I am working on a problem to compare two different datasets, and rank each record of the first with respect to the other, in terms of how similar they are. The records are dimensional, but do not have a lot of dimensions. Some of the fields will be compared for exact matches, some for similar sound, some with closest match etc. One of the datasets is large, and the other is much smaller. The final goal is to compute a rank between each record of first dataset with each record of the second. The rank is based on weighted scores of each dimension comparison. I was wondering if people in the community have any advice/suggested patterns/thoughts about cross joining two datasets in map reduce. Do let me know if you have any suggestions. Thanks and Regards, Sonal Hadoop ETL and Data Integration Nube
Re: how to use hadoop apis with cloudera distribution ?
You can check the Cloudera Training Videos, where is a screencast explaining how to develop Hadoop using Eclipse. http://www.cloudera.com/presentations http://vimeo.com/cloudera Now, For working with Hadoop APIs using Eclipse, for developing applications based on Hadoop, you can use the Kamasphere Plugin for Hadoop Development, or if you are a NetBeans user, they have a module for that. Regards. - Mensaje original - De: Mapred Learn mapred.le...@gmail.com Para: Marcos Ortiz mlor...@uci.cu CC: mapreduce-user@hadoop.apache.org Enviados: Martes, 8 de Marzo 2011 12:26:00 (GMT-0500) Auto-Detected Asunto: Re: how to use hadoop apis with cloudera distribution ? Thanks Marco ! I was trying to use CDH3 with eclipse and not able to know why eclipse complains for the import statement for hadoop apis when cloudera already includes them. I did not understand how CDH3 works with eclipse, does it download hadoop apis when we add svn urls ? On Tue, Mar 8, 2011 at 7:22 AM, Marcos Ortiz mlor...@uci.cu wrote: On Tue, 2011-03-08 at 07:16 -080s, 0, Mapred Learn wrote: Hi, I downloaded CDH3 VM for hadoop but if I want to use something like: import org.apache.hadoop.conf.Configuration; in my java code, what else do I need to do ? Can you see all tutorial that Cloudera has on its site http://www.cloudera.com/presentations http://www.cloudera.com/info/training http://www.cloudera.com/developers/learn-hadoop/ Can you check the CDH3 Official Documentation and the last news about the new release: http://docs.cloudera.com http://www.cloudera.com/blog/category/cdh/ Do i need to download hadoop from apache ? No, CDH beta 3 has with all required tools to work with Hadoop, even more applications like HUE, Oozie, Zookepper, Pig, Hive, Chukwa, HBase, Flume, etc if yes, then what does cdh3 do ? The Cloudera' colleagues has a excelent work packaging the most used applications with Hadoop on a single virtual machine for testing and they did a better approach to use Hadoop. They has Red Hat and Ubuntu/Debian compatible packages to do more easy the installation, configuration and use of Hadoop on these operating systems. Please, read http://docs.cloudera.com if not, then where can i find hadoop code on cdh VM ? I am using above line in my java code in eclipse and eclipse is not able to find it. Do you set JAVA_HOME, and HADOOP_HOME on your system? If you have any doubt with this, you can check the excellent DZone' refcards about Getting Started with Hadood and Deploying Hadoop written by Eugene Ciurana( http://eugeneciurana.eu ), VP of Technology at Badoo.com Regards, and I hope that this information could be useful for you. -- Marcos Luís Ortíz Valmaseda Software Engineer Centro de Tecnologías de Gestión de Datos (DATEC) Universidad de las Ciencias Informáticas http://uncubanitolinuxero.blogspot.com http://www.linkedin.com/in/marcosluis2186 -- Marcos Luís Ortíz Valmaseda Software Engineer Universidad de las Ciencias Informáticas Linux User # 418229 http://uncubanitolinuxero.blogspot.com http://www.linkedin.com/in/marcosluis2186
Re: Dataset comparison and ranking - views
On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote: Hi, I am working on a problem to compare two different datasets, and rank each record of the first with respect to the other, in terms of how similar they are. The records are dimensional, but do not have a lot of dimensions. Some of the fields will be compared for exact matches, some for similar sound, some with closest match etc. One of the datasets is large, and the other is much smaller. The final goal is to compute a rank between each record of first dataset with each record of the second. The rank is based on weighted scores of each dimension comparison. I was wondering if people in the community have any advice/suggested patterns/thoughts about cross joining two datasets in map reduce. Do let me know if you have any suggestions. Thanks and Regards, Sonal Hadoop ETL and Data Integration Nube Technologies Regards, Sonal. Can you give us more information about a basic workflow of your idea? Some questions: - How do you know that two records are identical? By id? - Can you give a example of the ranking that you want to archieve with a match of each case: - two records that are identical - two records that ar similar - two records with the closest match For MapReduce Design's Algoritms, I recommend to you this excelent from Ricky Ho: http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html For the join of the two datasets, you can use Pig for this. Here you have a basic Pig example from Milind Bhandarkar (mili...@yahoo-inc.com)'s talk Practical Problem Solving with Hadoop and Pig: Users = load ‘users’ as (name, age); Filtered = filter Users by age = 18 and age = 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, COUNT(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’; -- Marcos Luís Ortíz Valmaseda Software Engineer Centro de Tecnologías de Gestión de Datos (DATEC) Universidad de las Ciencias Informáticas http://uncubanitolinuxero.blogspot.com http://www.linkedin.com/in/marcosluis2186