Re: Sqoop Installation on Apache Hadop 0.20.2
Thank you both Aaron and Sonal for your precious comments and contributions. I'll check both the projects and try to make a design decision. I'm familiar with the sqoop and just heard about hiho. Sonal: I guess what hiho is a single map/reduce job handling the MySQL hadoop Integration. Is it also possible to use it with other JDBC connectors too? Best Regards, Utku On Fri, Mar 19, 2010 at 5:07 AM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi Utku, If MySQL is your target database, you may check Meghsoft's hiho: http://code.google.com/p/hiho/ The current release supports transferring data from Hadoop to the MySQL database. We will be releasing the functionality of transfer from MySQL to Hadoop soon, sometime next week. Thanks and Regards, Sonal www.meghsoft.com On Thu, Mar 18, 2010 at 5:31 AM, Aaron Kimball aa...@cloudera.com wrote: Hi Utku, Apache Hadoop 0.20 cannot support Sqoop as-is. Sqoop makes use of the DataDrivenDBInputFormat (among other APIs) which are not shipped with Apache's 0.20 release. In order to get Sqoop working on 20, you'd need to apply a lengthy list of patches from the project source repository to your copy of Hadoop and recompile. Or you could just download it all from Cloudera, where we've done that work for you :) So as it stands, Sqoop won't be able to run on 0.20 unless you choose to use Cloudera's distribution. Do note that your use of the term fork is a bit strong here; with the exception of (minor) modifications to make it interact in a more compatible manner with the external Linux environment, our distribution only includes code that's available to the project at large. But some of that code has not been rolled into a binary release from Apache yet. If you choose to go with Cloudera's distribution, it just means that you get publicly-available features (like Sqoop, MRUnit, etc.) a year or so ahead of what Apache has formally released, but our codebase isn't radically diverging; CDH is just somewhere ahead of the Apache 0.20 release, but behind Apache's svn trunk. (All of Sqoop, MRUnit, etc. are available in the Hadoop source repository on the trunk branch.) If you install our distribution, then Sqoop will be installed in /usr/lib/hadoop-0.20/contrib/sqoop and /usr/bin/sqoop for you. There isn't a separate package to install Sqoop independent of the rest of CDH; thus no extra download link on our site. I hope this helps! Good luck, - Aaron On Wed, Mar 17, 2010 at 4:30 AM, Reik Schatz reik.sch...@bwin.org wrote: At least for MRUnit, I was not able to find it outside of the Cloudera distribution (CDH). What I did: installing CDH locally using apt (Ubuntu), searched for and copied the mrunit library into my local Maven repository, and removed CDH after. I guess the same is somehow possible for Sqoop. /Reik Utku Can Topçu wrote: Dear All, I'm trying to run tests using MySQL as some kind of a datasource, so I thought cloudera's sqoop would be a nice project to have in the production. However, I'm not using the cloudera's hadoop distribution right now, and actually I'm not thinking of switching from a main project to a fork. I read the documentation on sqoop at http://www.cloudera.com/developers/downloads/sqoop/ but there are actually no links for downloading the sqoop itself. Has anyone here know, and tried to use sqoop with the latest apache hadoop? If so can you give me some tips and tricks on it? Best Regards, Utku -- *Reik Schatz* Technical Lead, Platform P: +46 8 562 470 00 M: +46 76 25 29 872 F: +46 8 562 470 01 E: reik.sch...@bwin.org mailto:reik.sch...@bwin.org */bwin/* Games AB Klarabergsviadukten 82, 111 64 Stockholm, Sweden [This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorised copying, disclosure or distribution of the material in this e-mail is strictly forbidden.]
Re: Why must I wait for NameNode?
There's a bit of an issue if you have no data in your HDFS -- 0 blocks out of 0 is considered 100% reported, so NN leaves safe mode even if there are no DNs talking to it yet. For a fix, please see HDFS-528, included in Cloudera's CDH2. Thanks -Todd On Fri, Mar 19, 2010 at 10:29 AM, Bill Habermaas b...@habermaas.us wrote: At startup, the namenode goes into 'safe' mode to wait for all data nodes to send block reports on data they are holding. This is normal for hadoop and necessary to make sure all replicated data is accounted for across the cluster. It is the nature of the beast to work this way for good reasons. Bill -Original Message- From: Nick Klosterman [mailto:nklos...@ecn.purdue.edu] Sent: Friday, March 19, 2010 1:21 PM To: common-user@hadoop.apache.org Subject: Why must I wait for NameNode? What is the namemode doing upon startup? I have to wait about 1 minute and watch for the namenode dfs usage drop from 100% otherwise the install is unusable. Is this typical? Is something wrong with my install? I've been attempting the Pseudo distributed tutorial example for a while trying to get it to work. I finally discovered that the namenode upon start up is 100% in use and I need to wait about 1 minute before I can use it. Is this typical of hadoop installations? This isn't entirely clear in the tutorial. I believe that a note should be entered if this is typical. This error caused me to get WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: SOMEFILE could only be replicated to 0 nodes, instead of 1 I had written a script to do all of the steps right in a row. Now with a 1 minute wait things work. Is my install atypical or am I doing something wrong that is causing this needed wait time. Thanks, Nick -- Todd Lipcon Software Engineer, Cloudera
Re: Why must I wait for NameNode?
If you don't want to wait then you can do bin/hadoop dfsadmin -safemode leave. And this might be useful for reference. -safemode enter|leave|get|wait: Safe mode maintenance command. Safe mode is a Namenode state in which it 1. does not accept changes to the name space (read-only) 2. does not replicate or delete blocks. Safe mode is entered automatically at Namenode startup, and leaves safe mode automatically when the configured minimum percentage of blocks satisfies the minimum replication condition. Safe mode can also be entered manually, but then it can only be turned off manually as well. Ravi Hadoop @ Yahoo! On 3/19/10 10:29 AM, Bill Habermaas b...@habermaas.us wrote: At startup, the namenode goes into 'safe' mode to wait for all data nodes to send block reports on data they are holding. This is normal for hadoop and necessary to make sure all replicated data is accounted for across the cluster. It is the nature of the beast to work this way for good reasons. Bill -Original Message- From: Nick Klosterman [mailto:nklos...@ecn.purdue.edu] Sent: Friday, March 19, 2010 1:21 PM To: common-user@hadoop.apache.org Subject: Why must I wait for NameNode? What is the namemode doing upon startup? I have to wait about 1 minute and watch for the namenode dfs usage drop from 100% otherwise the install is unusable. Is this typical? Is something wrong with my install? I've been attempting the Pseudo distributed tutorial example for a while trying to get it to work. I finally discovered that the namenode upon start up is 100% in use and I need to wait about 1 minute before I can use it. Is this typical of hadoop installations? This isn't entirely clear in the tutorial. I believe that a note should be entered if this is typical. This error caused me to get WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: SOMEFILE could only be replicated to 0 nodes, instead of 1 I had written a script to do all of the steps right in a row. Now with a 1 minute wait things work. Is my install atypical or am I doing something wrong that is causing this needed wait time. Thanks, Nick Ravi --
Re: (Strange!)getFileSystem in JVM shutdown hook throws shutdown in progress exception
I have logged a comment in https://issues.apache.org/jira/browse/HADOOP-4829which is related to IllegalStateException that I saw when Cache.remove() tried to remove shutdown hook in the process of JVM shutting down. Cheers On Wed, Mar 10, 2010 at 11:00 AM, Todd Lipcon t...@cloudera.com wrote: Hi, The issue here is that Hadoop itself uses a shutdown hook to close all open filesystems when the JVM shuts down. Since JVM shutdown hooks don't have a specified order, you shouldn't access Hadoop filesystem objects from a shutdown hook. To get around this you can use the fs.automatic.close configuration variable (provided by this patch: https://issues.apache.org/jira/browse/HADOOP-4829) to disable the Hadoop shutdown hook. This patch is applied to CDH2 (or else you'll have to apply it manually) Note that if you disable the shutdown hook, you'll need to manually close the filesystems using FileSystem.closeAll Thanks -Todd On Tue, Mar 9, 2010 at 9:39 PM, Silence wil...@yahoo.cn wrote: Hi fellows Below code segment add a shutdown hook to JVM, but when I got a strange exception, java.lang.IllegalStateException: Shutdown in progress at java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:39) at java.lang.Runtime.addShutdownHook(Runtime.java:192) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1387) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:191) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:180) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at young.Main$1.run(Main.java:21) Java doc said this exception is threw when the virtual machine is already in the process of shutting down, (http://java.sun.com/j2se/1.5.0/docs/api/ ), what does this mean? Why this happen? How to fix ? I'm really appreciate if you can try this code, and help me to figure out what's going on here, thank you ! --- import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapred.JobConf; @SuppressWarnings(deprecation) public class Main { public static void main(String[] args) { Runtime.getRuntime().addShutdownHook(new Thread() { @Override public void run() { Path path = new Path(/temp/hadoop-young); System.out.println(Thread run : + path); Configuration conf = new JobConf(); FileSystem fs; try { fs = path.getFileSystem(conf); if(fs.exists(path)){ fs.delete(path); } } catch (Exception e) { System.err.println(e.getMessage()); e.printStackTrace(); } }; }); } } -- View this message in context: http://old.nabble.com/%28Strange%21%29getFileSystem-in-JVM-shutdown-hook-throws-shutdown-in-progress-exception-tp27845803p27845803.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Todd Lipcon Software Engineer, Cloudera
Re: performance analysis?
Thanks, Ninad. This really helps. Best regards, Michael --- On Fri, 3/19/10, Ninad Raut hbase.user.ni...@gmail.com wrote: From: Ninad Raut hbase.user.ni...@gmail.com Subject: Re: performance analysis? To: common-user@hadoop.apache.org Date: Friday, March 19, 2010, 12:02 AM The Best and Easy to Configure tool is Ganglia. Haoop has built in support gor Ganglia. Check out YDN Ganglia setup steps and you will be able to monitor ur CPU and Mapr Reduce Jobs as well. TO monitor Network Related aspects you can check out Nagios. Regards, Ninad R On Fri, Mar 19, 2010 at 3:39 AM, jiang licht licht_ji...@yahoo.com wrote: To test bottle neck, I tried to figure out if some processes/threads are often blocked and wait for either disk or network i/o and why if either mapper or reducer runs slow. In my case, on each slave, up to 12 mappers are allowed to run simultaneously. CPU are more than 90% of time in idle mode and about at most 2% in iowait. But I found most mappers (from top and jps) were in sleep and strace shows that they (including tasktracker and datanode) were blocked on futex(0x4035b9d0, FUTEX_WAIT, 12566, NULL, Here's a list of accumulated open files (including network, pipe, socket, etc) of data node grouped by type; IPv6 15 unix 1 DIR 2 CHR 4 17 REG 122 sock 1 FIFO 34 Here's a list of accumulated open files (including network, pipe, socket, etc) of task tracker grouped by type; IPv6 24 unix 1 DIR 2 CHR 4 4 REG 105 sock 1 FIFO 50 Here's a typical mapper thread: IPv6 2 unix 1 1 DIR 4 sock 1 FIFO 2 CHR 6 REG 106 A mapper would block on futex for about a minute or so. It seems to me that various i/o cannot catch up with CPU. Would it be helpful to increase some buffer parameters to handle this? OR does this stats imply sth else? BTW, what is an effective way to analyze peformance of a hadoop cluster and what about good tools? Any recommendations? Thanks, Michael
Re: java.lang.NullPointerException at org.apache.hadoop.mapred.IFile$Writer.(IFile.java:102)
Thanks, Amogh. Best regards, Michael --- On Thu, 3/18/10, Amogh Vasekar am...@yahoo-inc.com wrote: From: Amogh Vasekar am...@yahoo-inc.com Subject: Re: java.lang.NullPointerException at org.apache.hadoop.mapred.IFile$Writer.(IFile.java:102) To: common-user@hadoop.apache.org common-user@hadoop.apache.org Date: Thursday, March 18, 2010, 11:34 PM Hi, http://hadoop.apache.org/common/docs/current/native_libraries.html Should answer your questions. Amogh On 3/18/10 10:48 PM, jiang licht licht_ji...@yahoo.com wrote: I got the following error when I tried to do gzip compression on map output, using hadoop-0.20.1. settings in mapred-site.xml-- mapred.compress.map.output=true mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec error message-- java.lang.NullPointerException at org.apache.hadoop.mapred.IFile$Writer.(IFile.java:102) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1198) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1091) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) I read the src that Writer in IFile takes care of map output compression. So, it seems to me that I didn't have gzip native library built or didn't have correct settings. There is no built folder in HADOOP_HOME and no native in lib folder in HADOOP_HOME. I checked that I have gzip and zlib installed. So, next is to build hadoop native library on top of these. How to do that? Is it a simple matter of pointing some variable to gzip or zlib libs or should I use build.xml in hadoop to build some target, what target should I build? Thanks, Michael