Re: newbie install
Turns out, it does cause problems later on. I think the problem is that the slaves have, in their hosts files: 127.0.0.1 localhost.localdomain localhost 127.0.0.1 machinename.cse.sc.edu machinename The reduce phase fails because the reducer cannot get data from the mappers as it tries to open a connection to "http://localhost:"; This is kinda annoying as all the hostnames resolve properly using DNS. I think it qualify as a hadoop bug, or maybe not. Jose On Wed, Jul 23, 2008 at 10:19 AM, Edward J. Yoon <[EMAIL PROTECTED]> wrote: > That's good. :) > >> Will this cause bigger problems later on? or should I just ignore it. > > I'm not sure, But I guess there is no problem. > Does anyone have some experience with that? > > Regards, Edward J. Yoon > > On Wed, Jul 23, 2008 at 11:05 PM, Jose Vidal <[EMAIL PROTECTED]> wrote: >> Thanks! that worked. I was able to run dfs and put some files in it. >> >> However, when I go to my namenode at http://namenode:50070 I see that >> all the datanodes have a name of "localhost". >> >> Will this cause bigger problems later on? or should I just ignore it. >> >> Jose >> >> On Tue, Jul 22, 2008 at 6:48 PM, Edward J. Yoon <[EMAIL PROTECTED]> wrote: So, do I need to change the host file in all the slaves, or just the namenode? >>> >>> Just the namenode. >>> >>> Thanks, Edward >>> >>> On Wed, Jul 23, 2008 at 7:45 AM, Jose Vidal <[EMAIL PROTECTED]> wrote: Yes, the host file just has: 127.0.0.1 localhost hermes.cse.sc.edu hermes So, do I need to change the host file in all the slaves, or just the namenode? I'm not root on these machines so changing these requires gentle handling of our sysadmin Jose On Tue, Jul 22, 2008 at 5:37 PM, Edward J. Yoon <[EMAIL PROTECTED]> wrote: > If you have a static address for the machine, make sure that your > hosts file is pointing to the static address for the namenode host > name as opposed to the 127.0.0.1 address. It should look something > like this with the values replaced with your values. > > 127.0.0.1 localhost.localdomain localhost > 192.x.x.x yourhost.yourdomain.com yourhost > > - Edward > > On Wed, Jul 23, 2008 at 6:03 AM, Jose Vidal <[EMAIL PROTECTED]> wrote: >> I'm trying to install hadoop on our linux machine but after >> start-all.sh none of the slaves can connect: >> >> 2008-07-22 16:35:27,534 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG: >> / >> STARTUP_MSG: Starting DataNode >> STARTUP_MSG: host = thetis/127.0.0.1 >> STARTUP_MSG: args = [] >> STARTUP_MSG: version = 0.16.4 >> STARTUP_MSG: build = >> http://svn.apache.org/repos/asf/hadoop/core/branches/bran >> ch-0.16 -r 652614; compiled by 'hadoopqa' on Fri May 2 00:18:12 UTC 2008 >> / >> 2008-07-22 16:35:27,643 WARN org.apache.hadoop.dfs.DataNode: Invalid >> directory i >> n dfs.data.dir: directory is not writable: /work >> 2008-07-22 16:35:27,699 INFO org.apache.hadoop.ipc.Client: Retrying >> connect to s >> erver: hermes.cse.sc.edu/129.252.130.148:9000. Already tried 1 time(s). >> 2008-07-22 16:35:28,700 INFO org.apache.hadoop.ipc.Client: Retrying >> connect to s >> erver: hermes.cse.sc.edu/129.252.130.148:9000. Already tried 2 time(s). >> 2008-07-22 16:35:29,700 INFO org.apache.hadoop.ipc.Client: Retrying >> connect to s >> erver: hermes.cse.sc.edu/129.252.130.148:9000. Already tried 3 time(s). >> 2008-07-22 16:35:30,701 INFO org.apache.hadoop.ipc.Client: Retrying >> connect to s >> erver: hermes.cse.sc.edu/129.252.130.148:9000. Already tried 4 time(s). >> 2008-07-22 16:35:31,702 INFO org.apache.hadoop.ipc.Client: Retrying >> connect to s >> erver: hermes.cse.sc.edu/129.252.130.148:9000. Already tried 5 time(s). >> 2008-07-22 16:35:32,702 INFO org.apache.hadoop.ipc.Client: Retrying >> connect to s >> erver: hermes.cse.sc.edu/129.252.130.148:9000. Already tried 6 time(s). >> >> same for the tasktrackers (port 9001). >> >> I think the problem has something to do with name resolution. Check >> these out: >> >> [EMAIL PROTECTED]:~/hadoop-0.16.4> telnet hermes.cse.sc.edu 9000 >> Trying 127.0.0.1... >> Connected to hermes.cse.sc.edu (127.0.0.1). >> Escape character is '^]'. >> bye >> Connection closed by foreign host. >> >> [EMAIL PROTECTED]:~/hadoop-0.16.4> host hermes.cse.sc.edu >> hermes.cse.sc.edu has address 129.252.130.148 >> >> [EMAIL PROTECTED]:~/hadoop-0.16.4> telnet 129.252.130.148 9000 >> Trying 129.252.130.148... >> telnet: connect to address 129.252.130.148: Connection refused >> telnet: Unable to connect to remote host: Connection refused >> >> So, the f
Re: Bean Scripting Framework?
On Jul 25, 2008, at 3:53 PM, Joydeep Sen Sarma wrote: Just as an aside - there is probably a general perception that streaming is really slow (at least I had it). The last I did some profiling (in 0.15) - the primary overheads from streaming came from the scripting language (python is sssw). For an insanely fast script (bin/cat), I saw significant overheads in java function/data path that drowned out streaming overheads by huge margin (lot of those overheads have been fixed in recent versions - thanks to the hadoop team). Writing a c/c++ streaming program is pretty good way of getting good performance (and some performance sensitive apps in our environment ended up doing just that). Agreed that not all hooks are available. Hadoop Pipes? Arun -Original Message- From: James Moore [mailto:[EMAIL PROTECTED] Sent: Friday, July 25, 2008 6:18 AM To: core-user@hadoop.apache.org Subject: Re: Bean Scripting Framework? On Thu, Jul 24, 2008 at 10:48 PM, Venkat Seeth <[EMAIL PROTECTED]> wrote: Why dont you use hadoop streaming? I think that's more of a broader question - why doesn't everyone use streaming? There's no real difference between doing Hadoop in Ruby/Scala/Java/Jython/whatever - these days, Java is just one of many languages that run on the JVM. There's not language-specific reason to pick streaming over a native implementation if you're working in a language that has a JVM implementation. I'm working on a Ruby interface just because I think there's a space for a nice DSL for setting up Hadoop and running tasks that's more pleasant for people used to writing Ruby than the current idioms. Streaming is great for things that don't run on a JVM - Erlang, Haskell, Smalltalk, etc. If you're streaming, though, you loose all the flexibility of Hadoop. You get line-oriented text in and out, and that's about it. But if you want all the Hadoop features, you're going to want to go native, be it in Ruby, Scala, Java, or whatever your language of choice is. Streaming is powerful, and huge numbers of solutions of the form "my_code < data > output" have solved many, many problems over the years. If your problem fits in the streaming space, then you should consider it. And I think that's a language-neutral statement - just because your solution is in Java doesn't mean you should bother hooking it up into a native Hadoop app. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
RE: Bean Scripting Framework?
Just as an aside - there is probably a general perception that streaming is really slow (at least I had it). The last I did some profiling (in 0.15) - the primary overheads from streaming came from the scripting language (python is sssw). For an insanely fast script (bin/cat), I saw significant overheads in java function/data path that drowned out streaming overheads by huge margin (lot of those overheads have been fixed in recent versions - thanks to the hadoop team). Writing a c/c++ streaming program is pretty good way of getting good performance (and some performance sensitive apps in our environment ended up doing just that). Agreed that not all hooks are available. -Original Message- From: James Moore [mailto:[EMAIL PROTECTED] Sent: Friday, July 25, 2008 6:18 AM To: core-user@hadoop.apache.org Subject: Re: Bean Scripting Framework? On Thu, Jul 24, 2008 at 10:48 PM, Venkat Seeth <[EMAIL PROTECTED]> wrote: > Why dont you use hadoop streaming? I think that's more of a broader question - why doesn't everyone use streaming? There's no real difference between doing Hadoop in Ruby/Scala/Java/Jython/whatever - these days, Java is just one of many languages that run on the JVM. There's not language-specific reason to pick streaming over a native implementation if you're working in a language that has a JVM implementation. I'm working on a Ruby interface just because I think there's a space for a nice DSL for setting up Hadoop and running tasks that's more pleasant for people used to writing Ruby than the current idioms. Streaming is great for things that don't run on a JVM - Erlang, Haskell, Smalltalk, etc. If you're streaming, though, you loose all the flexibility of Hadoop. You get line-oriented text in and out, and that's about it. But if you want all the Hadoop features, you're going to want to go native, be it in Ruby, Scala, Java, or whatever your language of choice is. Streaming is powerful, and huge numbers of solutions of the form "my_code < data > output" have solved many, many problems over the years. If your problem fits in the streaming space, then you should consider it. And I think that's a language-neutral statement - just because your solution is in Java doesn't mean you should bother hooking it up into a native Hadoop app. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
Re: Bean Scripting Framework?
On Friday 25 July 2008 15:18:24 James Moore wrote: > On Thu, Jul 24, 2008 at 10:48 PM, Venkat Seeth <[EMAIL PROTECTED]> wrote: > > Why dont you use hadoop streaming? > > I think that's more of a broader question - why doesn't everyone use > streaming? > > There's no real difference between doing Hadoop in > Ruby/Scala/Java/Jython/whatever - these days, Java is just one of many > languages that run on the JVM. There's not language-specific reason > to pick streaming over a native implementation if you're working in a > language that has a JVM implementation. I'm working on a Ruby > interface just because I think there's a space for a nice DSL for > setting up Hadoop and running tasks that's more pleasant for people > used to writing Ruby than the current idioms. Well, there are reasons to go for streaming. It's an acceptable interface. For many non-Java developers it's by far a nicer interface then trying to use the Java APIs, that are by scripting language standards clunky at best. For a comparison, I managed to write a tar implementation that reads/writes to S3 in cpython/boto in less than a day. hdfstar, that is a Jython script took me a week. Some time was clearly spent debugging the Java/Jython setup and fixing tarfile.py, but most time was spent dealing with the Java APIs to read/write HDFS files. Note please that I did not know S3/boto before either. And note, that I can read Java quite fine. (Actually, I can even write it quite fine, if forced by an employer :-P ) So if streaming.jar is enough for a given usecase, use it. As a side benefit you get a program that you can use differently, without hadoop. Andreas > > Streaming is great for things that don't run on a JVM - Erlang, > Haskell, Smalltalk, etc. > > If you're streaming, though, you loose all the flexibility of Hadoop. > You get line-oriented text in and out, and that's about it. But if > you want all the Hadoop features, you're going to want to go native, > be it in Ruby, Scala, Java, or whatever your language of choice is. > > Streaming is powerful, and huge numbers of solutions of the form > "my_code < data > output" have solved many, many problems over the > years. If your problem fits in the streaming space, then you should > consider it. And I think that's a language-neutral statement - just > because your solution is in Java doesn't mean you should bother > hooking it up into a native Hadoop app. signature.asc Description: This is a digitally signed message part.
Re: Bean Scripting Framework?
This is a bit scattered but I wanted to post this in case it might help someone... Here's a little more detail on the loading problems I've been having. For now, I'm just trying to call some ruby from the reduce method of my map/reduce job. I want to move to a more general setup, like the one James Moore proposes above, but I'm taking baby steps due to my general lack of knowledge regarding hadoop and jruby. The first problem I encountered was that, from within hadoop, was unable to load the scripting framework (JSR223) at all. I was getting this exception (Using JRubyScriptEngineManager): Exception in thread "main" java.lang.NullPointerException at org.jruby.runtime.load.LoadService.findFile(LoadService.java:476) at org.jruby.runtime.load.LoadService.findLibrary(LoadService.java:394) at org.jruby.runtime.load.LoadService.smartLoad(LoadService.java:259) at org.jruby.runtime.load.LoadService.require(LoadService.java:349) at com.sun.script.jruby.JRubyScriptEngine.init(JRubyScriptEngine.java:484) at com.sun.script.jruby.JRubyScriptEngine.(JRubyScriptEngine.java:96) at com.sun.script.jruby.JRubyScriptEngineFactory.getScriptEngine(JRubyScriptEngineFactory.java:134) at com.sun.script.jruby.JRubyScriptEngineManager.registerEngineNames(JRubyScriptEngineManager.java:95) at com.sun.script.jruby.JRubyScriptEngineManager.init(JRubyScriptEngineManager.java:72) at com.sun.script.jruby.JRubyScriptEngineManager.(JRubyScriptEngineManager.java:66) at com.sun.script.jruby.JRubyScriptEngineManager.(JRubyScriptEngineManager.java:61) at com.talentspring.TestMapreduce.dump(TestMapreduce.java:236) at com.talentspring.TestMapreduce.main(TestMapreduce.java:432) Poking around the JRubyScriptEngine source (https://scripting.dev.java.net/source/browse/scripting/engines/jruby/src/com/sun/script/jruby/) it looks like it uses the property "com.sun.script.jruby.loadpath" and not "jruby.home" as suggested by http://wiki.jruby.org/wiki/Java_Integration#Java_6_.28using_JSR_223:_Scripting.29 . hmmm. I added -Dcom.sun.script.jruby.loadpath=$JRUBY_HOME to my invocation and it worked... sort of. I found that by the time execution reached the 'configure' method, the load path property was null. Odd. Does anybody know why this might be? In any case, I saved the value in my JobConf before submitting the job, like so: jobConf.set("jruby.load_path", System.getProperty("com.sun.script.jruby.loadpath")); Then, in the configure method I have: System.setProperty("com.sun.script.jruby.loadpath", jobConf.get("jruby.load_path")); I then load the script engine and everything works... So: Does anybody have any idea of why i might be losing the system load path property when I get to the configure method? Cheers, -lincoln -- lincolnritter.com On Fri, Jul 25, 2008 at 10:22 AM, Lincoln Ritter <[EMAIL PROTECTED]> wrote: > I was using BSF to avoid java 6 issues. However I'm having similar > issues using both systems. Basically, I can't load the scripting > engine from within hadoop. I have successfully compiled and run some > stand-alone test examples but am having trouble getting anything to > work from hadoop. One confounding factor is that my development > machine is OS X 10.5 with the stock 1.5 JDK. On the surface this > doesn't seem to be a problem given the success I've had at creating > small stand-alone tests... I run the stand-alone stuff with exactly > the same classpath and environment so it seems that something weird is > going on. Additionally, as a sanity check, I've tried loading the > javascript engine and that does work from within hadoop. > > All the JSR jars are on the classpath and I'm kinking off the hadoop > process using the -Djruby.home=... option. Did you have to do > anything special here? > > -lincoln > > -- > lincolnritter.com > > > > On Thu, Jul 24, 2008 at 7:00 PM, James Moore <[EMAIL PROTECTED]> wrote: >> On Thu, Jul 24, 2008 at 3:51 PM, Lincoln Ritter >> <[EMAIL PROTECTED]> wrote: >>> Well that sounds awesome! It would be simply splendid to see what >>> you've got if you're willing to share. >> >> I'll be happy to share, but it's pretty much in pieces, not ready for >> release. I'll put it out with whatever license Hadoop itself uses >> (presumably Apache). >> >>> >>> Are you going the 'direct' embedding route or using a scripting frame >>> work (BSF or javax.script)? >> >> JSR233 is the way to go according to the JRuby guys at RailsConf last >> month. It's pretty straightforward - see >> http://wiki.jruby.org/wiki/Java_Integration#Java_6_.28using_JSR_223:_Scripting.29 >> >> -- >> James Moore | [EMAIL PROTECTED] >> Ruby and Ruby on Rails consulting >> blog.restphone.com >> >
Is there a network communication counter for mapred?
Hi, Besides knowing "data-local" and "rack-local" map task numbers, I am interested in the size of data that are transferred on network. E.g., the size of intermediate map output transferred (not dealt locally). I wonder if there is such a counter. Thank you. Best, -Kevin
Re: Bean Scripting Framework?
I was using BSF to avoid java 6 issues. However I'm having similar issues using both systems. Basically, I can't load the scripting engine from within hadoop. I have successfully compiled and run some stand-alone test examples but am having trouble getting anything to work from hadoop. One confounding factor is that my development machine is OS X 10.5 with the stock 1.5 JDK. On the surface this doesn't seem to be a problem given the success I've had at creating small stand-alone tests... I run the stand-alone stuff with exactly the same classpath and environment so it seems that something weird is going on. Additionally, as a sanity check, I've tried loading the javascript engine and that does work from within hadoop. All the JSR jars are on the classpath and I'm kinking off the hadoop process using the -Djruby.home=... option. Did you have to do anything special here? -lincoln -- lincolnritter.com On Thu, Jul 24, 2008 at 7:00 PM, James Moore <[EMAIL PROTECTED]> wrote: > On Thu, Jul 24, 2008 at 3:51 PM, Lincoln Ritter > <[EMAIL PROTECTED]> wrote: >> Well that sounds awesome! It would be simply splendid to see what >> you've got if you're willing to share. > > I'll be happy to share, but it's pretty much in pieces, not ready for > release. I'll put it out with whatever license Hadoop itself uses > (presumably Apache). > >> >> Are you going the 'direct' embedding route or using a scripting frame >> work (BSF or javax.script)? > > JSR233 is the way to go according to the JRuby guys at RailsConf last > month. It's pretty straightforward - see > http://wiki.jruby.org/wiki/Java_Integration#Java_6_.28using_JSR_223:_Scripting.29 > > -- > James Moore | [EMAIL PROTECTED] > Ruby and Ruby on Rails consulting > blog.restphone.com >
Re: Using MapReduce to do table comparing.
On Thu, Jul 24, 2008 at 8:03 AM, Amber <[EMAIL PROTECTED]> wrote: > Yes, I think this is the simplest method , but there are problems too: > > 1. The reduce stage wouldn't begin until the map stage ends, by when we have > done a two table scanning, and the comparing will take almost the same time, > because about 90% of intermediate pairs will have two values and > different keys, if I can specify a number n, by when there are n intermediate > pairs with the same key the reduce tasks start, that will be better. In my > case I will set the magic number to 2. I don't think I understood this completely, but I'll try to respond. First, I think you're going to be doing something like two full table scans in any case. Whether it's in an RDBMS or in Hadoop, you need to read the complete dataset for both day1 and day2. (Or at least that's how I interpreted your original mail - you're not trying to keep deltas over N days, just doing a delta for yesterday/today from scratch every time) You could possibly speed this up by keeping some kind of parsed data in hadoop for previous days, rather than just text, but I wouldn't do this as my first solution. It seems like starting the reducers before the maps are done isn't going to buy you anything. The same amount of total work needs to be done; when the work starts doesn't matter much. In this case, I'm guessing that you're going to have a setup where (total number of maps) == (total number of reducers) == 4 * (number of 4-core machines). In any case, I'd say you should do some experiments with the most simple solution you can come up with. Your problem seems simple enough that just banging out some throwaway experimental code is going to a) not take very long, and b) tell you quite a bit about how your particular solution is going to behave in the real world. > > 2. I am not sure about how Hadoop stores intermediate pairs, we > would not afford it as data volume increasing if it is kept in memory. Hadoop is definitely prepared for very large numbers of intermediate key/value pairs - that's pretty much the normal case for hadoop jobs. It'll stream to/from disc as necessary. Take a look at combiners as well - they may buy you something. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
Re: Hadoop DFS
HBase is the project that use dfs if you want to know how to use dfs directly, "bin/hadoop" script may be a good entrance. For example, "bin/hadoop dfs -cat ***" *** is a file name in your dfs follow this command, you can find how to access dfs directly. Hope it will help you 在 2008-7-25,上午4:09,Wasim Bari 写道: Hi, I am new to Hadoop. Right now, I am Only interested to Work with Hadoop DFS. Can some one guide me where to start? Anyone has information about some application has already integrated Hadoop DFS ? Any information regarding Material about Hadoop DFS, case studies, Articles, books etc will be very nice. Thanks, Wasim
Re: Bean Scripting Framework?
On Thu, Jul 24, 2008 at 10:48 PM, Venkat Seeth <[EMAIL PROTECTED]> wrote: > Why dont you use hadoop streaming? I think that's more of a broader question - why doesn't everyone use streaming? There's no real difference between doing Hadoop in Ruby/Scala/Java/Jython/whatever - these days, Java is just one of many languages that run on the JVM. There's not language-specific reason to pick streaming over a native implementation if you're working in a language that has a JVM implementation. I'm working on a Ruby interface just because I think there's a space for a nice DSL for setting up Hadoop and running tasks that's more pleasant for people used to writing Ruby than the current idioms. Streaming is great for things that don't run on a JVM - Erlang, Haskell, Smalltalk, etc. If you're streaming, though, you loose all the flexibility of Hadoop. You get line-oriented text in and out, and that's about it. But if you want all the Hadoop features, you're going to want to go native, be it in Ruby, Scala, Java, or whatever your language of choice is. Streaming is powerful, and huge numbers of solutions of the form "my_code < data > output" have solved many, many problems over the years. If your problem fits in the streaming space, then you should consider it. And I think that's a language-neutral statement - just because your solution is in Java doesn't mean you should bother hooking it up into a native Hadoop app. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
Re: Bean Scripting Framework?
That sounds really interesting On Jul 25, 2008, at 00:42, James Moore wrote: Funny you should mention it - I'm working on a framework to do JRuby Hadoop this week. Something like: class MyHadoopJob < Radoop input_format :text_input_format output_format :text_output_format map_output_key_class :text map_output_value_class :text def mapper(k, v, output, reporter) # ... end def reducer(k, vs, output, reporter) end end Plus a java glue file to call the Ruby stuff. And then it jars up the ruby files, the gem directory, and goes from there. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com