Re: Sqoop Installation on Apache Hadop 0.20.2

2010-03-19 Thread Utku Can Topçu
Thank you both Aaron and Sonal for your precious comments and contributions.

I'll check both the projects and try to make a design decision.

I'm familiar with the sqoop and just heard about hiho.

Sonal: I guess what hiho is a single map/reduce job handling the MySQL
hadoop Integration. Is it also possible to use it with other JDBC connectors
too?

Best Regards,
Utku

On Fri, Mar 19, 2010 at 5:07 AM, Sonal Goyal sonalgoy...@gmail.com wrote:

 Hi Utku,

 If MySQL is your target database, you may check Meghsoft's hiho:

 http://code.google.com/p/hiho/

 The current release supports transferring data from Hadoop to the MySQL
 database. We will be releasing the functionality of transfer from MySQL to
 Hadoop soon, sometime next week.

 Thanks and Regards,
 Sonal
 www.meghsoft.com


 On Thu, Mar 18, 2010 at 5:31 AM, Aaron Kimball aa...@cloudera.com wrote:

  Hi Utku,
 
  Apache Hadoop 0.20 cannot support Sqoop as-is. Sqoop makes use of the
  DataDrivenDBInputFormat (among other APIs) which are not shipped with
  Apache's 0.20 release. In order to get Sqoop working on 20, you'd need to
  apply a lengthy list of patches from the project source repository to
 your
  copy of Hadoop and recompile. Or you could just download it all from
  Cloudera, where we've done that work for you :)
 
  So as it stands, Sqoop won't be able to run on 0.20 unless you choose to
  use
  Cloudera's distribution.  Do note that your use of the term fork is a
 bit
  strong here; with the exception of (minor) modifications to make it
  interact
  in a more compatible manner with the external Linux environment, our
  distribution only includes code that's available to the project at large.
  But some of that code has not been rolled into a binary release from
 Apache
  yet. If you choose to go with Cloudera's distribution, it just means that
  you get publicly-available features (like Sqoop, MRUnit, etc.) a year or
 so
  ahead of what Apache has formally released, but our codebase isn't
  radically
  diverging; CDH is just somewhere ahead of the Apache 0.20 release, but
  behind Apache's svn trunk. (All of Sqoop, MRUnit, etc. are available in
 the
  Hadoop source repository on the trunk branch.)
 
  If you install our distribution, then Sqoop will be installed in
  /usr/lib/hadoop-0.20/contrib/sqoop and /usr/bin/sqoop for you. There
 isn't
  a
  separate package to install Sqoop independent of the rest of CDH; thus no
  extra download link on our site.
 
  I hope this helps!
 
  Good luck,
  - Aaron
 
 
  On Wed, Mar 17, 2010 at 4:30 AM, Reik Schatz reik.sch...@bwin.org
 wrote:
 
   At least for MRUnit, I was not able to find it outside of the Cloudera
   distribution (CDH). What I did: installing CDH locally using apt
  (Ubuntu),
   searched for and copied the mrunit library into my local Maven
  repository,
   and removed CDH after. I guess the same is somehow possible for Sqoop.
  
   /Reik
  
  
   Utku Can Topçu wrote:
  
   Dear All,
  
   I'm trying to run tests using MySQL as some kind of a datasource, so I
   thought cloudera's sqoop would be a nice project to have in the
   production.
   However, I'm not using the cloudera's hadoop distribution right now,
 and
   actually I'm not thinking of switching from a main project to a fork.
  
   I read the documentation on sqoop at
   http://www.cloudera.com/developers/downloads/sqoop/ but there are
   actually
   no links for downloading the sqoop itself.
  
   Has anyone here know, and tried to use sqoop with the latest apache
   hadoop?
   If so can you give me some tips and tricks on it?
  
   Best Regards,
   Utku
  
  
  
   --
  
   *Reik Schatz*
   Technical Lead, Platform
   P: +46 8 562 470 00
   M: +46 76 25 29 872
   F: +46 8 562 470 01
   E: reik.sch...@bwin.org mailto:reik.sch...@bwin.org
   */bwin/* Games AB
   Klarabergsviadukten 82,
   111 64 Stockholm, Sweden
  
   [This e-mail may contain confidential and/or privileged information. If
  you
   are not the intended recipient (or have received this e-mail in error)
   please notify the sender immediately and destroy this e-mail. Any
   unauthorised copying, disclosure or distribution of the material in
 this
   e-mail is strictly forbidden.]
  
  
 



Re: Why must I wait for NameNode?

2010-03-19 Thread Todd Lipcon
There's a bit of an issue if you have no data in your HDFS -- 0 blocks out
of 0 is considered 100% reported, so NN leaves safe mode even if there are
no DNs talking to it yet.

For a fix, please see HDFS-528, included in Cloudera's CDH2.

Thanks
-Todd



On Fri, Mar 19, 2010 at 10:29 AM, Bill Habermaas b...@habermaas.us wrote:

 At startup, the namenode goes into 'safe' mode to wait for all data nodes
 to
 send block reports on data they are holding.  This is normal for hadoop and
 necessary to make sure all replicated data is accounted for across the
 cluster.  It is the nature of the beast to work this way for good reasons.

 Bill

 -Original Message-
 From: Nick Klosterman [mailto:nklos...@ecn.purdue.edu]
 Sent: Friday, March 19, 2010 1:21 PM
 To: common-user@hadoop.apache.org
 Subject: Why must I wait for NameNode?

 What is the namemode doing upon startup? I have to wait about 1 minute
 and watch for the namenode dfs usage drop from 100% otherwise the install
 is unusable. Is this typical? Is something wrong with my install?

 I've been attempting the Pseudo distributed tutorial example for a
 while trying to get it to work.  I finally discovered that the namenode
 upon start up is 100% in use and I need to wait about 1 minute before I
 can use it. Is this typical of hadoop installations?

 This isn't entirely clear in the tutorial.  I believe that a note should
 be entered if this is typical.  This error caused me to get WARN
 org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: SOMEFILE could
 only be replicated to 0 nodes, instead of 1

 I had written a script to do all of the steps right in a row.  Now with a
 1 minute wait things work. Is my install atypical or am I doing something
 wrong that is causing this needed wait time.

 Thanks,
 Nick





-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Why must I wait for NameNode?

2010-03-19 Thread Ravi Phulari

If you don't want to wait then you can do

 bin/hadoop dfsadmin -safemode leave.

And this might be useful for reference.

-safemode enter|leave|get|wait:  Safe mode maintenance command.
Safe mode is a Namenode state in which it
1.  does not accept changes to the name space (read-only)
2.  does not replicate or delete blocks.
Safe mode is entered automatically at Namenode startup, and
leaves safe mode automatically when the configured minimum
percentage of blocks satisfies the minimum replication
condition.  Safe mode can also be entered manually, but then
it can only be turned off manually as well.

Ravi
Hadoop @ Yahoo!

On 3/19/10 10:29 AM, Bill Habermaas b...@habermaas.us wrote:

At startup, the namenode goes into 'safe' mode to wait for all data nodes to
send block reports on data they are holding.  This is normal for hadoop and
necessary to make sure all replicated data is accounted for across the
cluster.  It is the nature of the beast to work this way for good reasons.

Bill

-Original Message-
From: Nick Klosterman [mailto:nklos...@ecn.purdue.edu]
Sent: Friday, March 19, 2010 1:21 PM
To: common-user@hadoop.apache.org
Subject: Why must I wait for NameNode?

What is the namemode doing upon startup? I have to wait about 1 minute
and watch for the namenode dfs usage drop from 100% otherwise the install
is unusable. Is this typical? Is something wrong with my install?

I've been attempting the Pseudo distributed tutorial example for a
while trying to get it to work.  I finally discovered that the namenode
upon start up is 100% in use and I need to wait about 1 minute before I
can use it. Is this typical of hadoop installations?

This isn't entirely clear in the tutorial.  I believe that a note should
be entered if this is typical.  This error caused me to get WARN
org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: SOMEFILE could
only be replicated to 0 nodes, instead of 1

I had written a script to do all of the steps right in a row.  Now with a
1 minute wait things work. Is my install atypical or am I doing something
wrong that is causing this needed wait time.

Thanks,
Nick




Ravi
--



Re: (Strange!)getFileSystem in JVM shutdown hook throws shutdown in progress exception

2010-03-19 Thread Ted Yu
I have logged a comment in
https://issues.apache.org/jira/browse/HADOOP-4829which is related to
IllegalStateException that I saw when Cache.remove()
tried to remove shutdown hook in the process of JVM shutting down.

Cheers

On Wed, Mar 10, 2010 at 11:00 AM, Todd Lipcon t...@cloudera.com wrote:

 Hi,

 The issue here is that Hadoop itself uses a shutdown hook to close all open
 filesystems when the JVM shuts down. Since JVM shutdown hooks don't have a
 specified order, you shouldn't access Hadoop filesystem objects from a
 shutdown hook.

 To get around this you can use the fs.automatic.close configuration
 variable
 (provided by this patch: https://issues.apache.org/jira/browse/HADOOP-4829)
 to disable the Hadoop shutdown hook. This patch is applied to CDH2 (or else
 you'll have to apply it manually)

 Note that if you disable the shutdown hook, you'll need to manually close
 the filesystems using FileSystem.closeAll

 Thanks
 -Todd

 On Tue, Mar 9, 2010 at 9:39 PM, Silence wil...@yahoo.cn wrote:

 
  Hi fellows
  Below code segment add a shutdown hook to JVM, but when I got a strange
  exception,
  java.lang.IllegalStateException: Shutdown in progress
 at
  java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:39)
 at java.lang.Runtime.addShutdownHook(Runtime.java:192)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1387)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:191)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:180)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
 at young.Main$1.run(Main.java:21)
  Java doc said this exception is threw when the virtual machine is already
  in
  the process of shutting down, (http://java.sun.com/j2se/1.5.0/docs/api/
 ),
  what does this mean? Why this happen? How to fix ?
  I'm really appreciate if you can try this code, and help me to figure out
  what's going on here, thank you !
 
 
 ---
  import org.apache.hadoop.conf.Configuration;
  import org.apache.hadoop.fs.FileSystem;
  import org.apache.hadoop.fs.Path;
  import org.apache.hadoop.mapred.JobConf;
 
  @SuppressWarnings(deprecation)
  public class Main {
 
 public static void main(String[] args) {
 Runtime.getRuntime().addShutdownHook(new Thread() {
 @Override
 public void run() {
 Path path = new
 Path(/temp/hadoop-young);
 System.out.println(Thread run :  +
 path);
 Configuration conf = new JobConf();
 FileSystem fs;
 try {
 fs = path.getFileSystem(conf);
 if(fs.exists(path)){
 fs.delete(path);
 }
 } catch (Exception e) {
 
  System.err.println(e.getMessage());
 e.printStackTrace();
 }
 };
 });
 }
  }
  --
  View this message in context:
 
 http://old.nabble.com/%28Strange%21%29getFileSystem-in-JVM-shutdown-hook-throws-shutdown-in-progress-exception-tp27845803p27845803.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 


 --
 Todd Lipcon
 Software Engineer, Cloudera



Re: performance analysis?

2010-03-19 Thread jiang licht
Thanks, Ninad. This really helps.

Best regards,

Michael

--- On Fri, 3/19/10, Ninad Raut hbase.user.ni...@gmail.com wrote:

From: Ninad Raut hbase.user.ni...@gmail.com
Subject: Re: performance analysis?
To: common-user@hadoop.apache.org
Date: Friday, March 19, 2010, 12:02 AM

The Best and Easy to Configure tool is Ganglia. Haoop has built in support
gor Ganglia. Check out YDN Ganglia setup steps and you will be able to
monitor ur CPU and Mapr Reduce Jobs as well.

TO monitor Network Related aspects you can check out Nagios.

Regards,
Ninad R

On Fri, Mar 19, 2010 at 3:39 AM, jiang licht licht_ji...@yahoo.com wrote:

 To test bottle neck, I tried to figure out if some processes/threads are
 often blocked and wait for either disk or network i/o and why if either
 mapper or reducer runs slow. In my case, on each slave, up to 12 mappers are
 allowed to run simultaneously. CPU are more than 90% of time in idle mode
 and about at most 2% in iowait. But I found most mappers (from top and
 jps)  were in sleep and strace shows that they (including tasktracker and
 datanode) were blocked on futex(0x4035b9d0, FUTEX_WAIT, 12566, NULL,

 Here's a list of accumulated open files (including network, pipe, socket,
 etc) of data node grouped by type;

 IPv6 15
 unix 1
 DIR 2
 CHR 4
  17
 REG 122
 sock 1
 FIFO 34

 Here's a list of accumulated open files (including network, pipe,
 socket, etc) of task tracker grouped by type;

 IPv6 24
 unix 1
 DIR 2
 CHR 4
  4
 REG 105
 sock 1
 FIFO 50

 Here's a typical mapper thread:

 IPv6 2
 unix 1
  1
 DIR 4
 sock 1
 FIFO 2
 CHR 6
 REG 106

 A mapper would block on futex for about a minute or so. It seems to me that
 various i/o cannot catch up with CPU. Would it be helpful to increase some
 buffer parameters to handle this? OR does this stats imply sth else? BTW,
 what is an effective way to analyze peformance of a hadoop cluster and what
 about good tools? Any recommendations?

 Thanks,

 Michael






  

Re: java.lang.NullPointerException at org.apache.hadoop.mapred.IFile$Writer.(IFile.java:102)

2010-03-19 Thread jiang licht
Thanks, Amogh.

Best regards,

Michael

--- On Thu, 3/18/10, Amogh Vasekar am...@yahoo-inc.com wrote:

From: Amogh Vasekar am...@yahoo-inc.com
Subject: Re: java.lang.NullPointerException at 
org.apache.hadoop.mapred.IFile$Writer.(IFile.java:102)
To: common-user@hadoop.apache.org common-user@hadoop.apache.org
Date: Thursday, March 18, 2010, 11:34 PM

Hi,
http://hadoop.apache.org/common/docs/current/native_libraries.html
Should answer your questions.

Amogh


On 3/18/10 10:48 PM, jiang licht licht_ji...@yahoo.com wrote:

I got the following error when I tried to do gzip compression on map output, 
using hadoop-0.20.1.

settings in mapred-site.xml--
mapred.compress.map.output=true
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

error message--
java.lang.NullPointerException
        at org.apache.hadoop.mapred.IFile$Writer.(IFile.java:102)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1198)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1091)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)


I read the src that Writer in IFile takes care of map output compression. So, 
it seems to me that I didn't have gzip native library built or didn't have 
correct settings. There is no built folder in HADOOP_HOME and no native in 
lib folder in HADOOP_HOME. I checked that I have gzip and zlib installed. So, 
next is to build hadoop native library on top of these. How to do that? Is it a 
simple matter of pointing some variable to gzip or zlib libs or should I use 
build.xml in hadoop to build some target, what target should I build?

 Thanks,

Michael