Re: newbie install

2008-07-25 Thread Jose Vidal
Turns out, it does cause problems later on.

I think the problem is that the slaves have, in their hosts files:

127.0.0.1 localhost.localdomain localhost
127.0.0.1 machinename.cse.sc.edu machinename

The reduce phase fails because the reducer cannot get data from the
mappers as it tries to open a connection to "http://localhost:";

This is kinda annoying as all the hostnames resolve properly using
DNS. I think it qualify as a hadoop bug, or maybe not.

Jose

On Wed, Jul 23, 2008 at 10:19 AM, Edward J. Yoon <[EMAIL PROTECTED]> wrote:
> That's good. :)
>
>> Will this cause bigger problems later on? or should I just ignore it.
>
> I'm not sure, But I guess there is no problem.
> Does anyone have some experience with that?
>
> Regards, Edward J. Yoon
>
> On Wed, Jul 23, 2008 at 11:05 PM, Jose Vidal <[EMAIL PROTECTED]> wrote:
>> Thanks! that worked. I was able to run dfs and put some files in it.
>>
>> However, when I go to my namenode at http://namenode:50070 I see that
>> all the datanodes have a name of "localhost".
>>
>> Will this cause bigger problems later on? or should I just ignore it.
>>
>> Jose
>>
>> On Tue, Jul 22, 2008 at 6:48 PM, Edward J. Yoon <[EMAIL PROTECTED]> wrote:
 So, do I need to change the host file in all the slaves, or just the 
 namenode?
>>>
>>> Just the namenode.
>>>
>>> Thanks, Edward
>>>
>>> On Wed, Jul 23, 2008 at 7:45 AM, Jose Vidal <[EMAIL PROTECTED]> wrote:
 Yes, the host file just has:

 127.0.0.1 localhost hermes.cse.sc.edu hermes

 So, do I need to change the host file in all the slaves, or just the 
 namenode?

 I'm not root on these machines so changing these requires gentle
 handling of our sysadmin

 Jose

 On Tue, Jul 22, 2008 at 5:37 PM, Edward J. Yoon <[EMAIL PROTECTED]> wrote:
> If you have a static address for the machine, make sure that your
> hosts file is pointing to the static address for the namenode host
> name as opposed to the 127.0.0.1 address. It should look something
> like this with the values replaced with your values.
>
> 127.0.0.1   localhost.localdomain localhost
> 192.x.x.x   yourhost.yourdomain.com yourhost
>
> - Edward
>
> On Wed, Jul 23, 2008 at 6:03 AM, Jose Vidal <[EMAIL PROTECTED]> wrote:
>> I'm trying to install hadoop on our linux machine but after
>> start-all.sh none of the slaves can connect:
>>
>> 2008-07-22 16:35:27,534 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
>> /
>> STARTUP_MSG: Starting DataNode
>> STARTUP_MSG:   host = thetis/127.0.0.1
>> STARTUP_MSG:   args = []
>> STARTUP_MSG:   version = 0.16.4
>> STARTUP_MSG:   build = 
>> http://svn.apache.org/repos/asf/hadoop/core/branches/bran
>> ch-0.16 -r 652614; compiled by 'hadoopqa' on Fri May  2 00:18:12 UTC 2008
>> /
>> 2008-07-22 16:35:27,643 WARN org.apache.hadoop.dfs.DataNode: Invalid 
>> directory i
>> n dfs.data.dir: directory is not writable: /work
>> 2008-07-22 16:35:27,699 INFO org.apache.hadoop.ipc.Client: Retrying 
>> connect to s
>> erver: hermes.cse.sc.edu/129.252.130.148:9000. Already tried 1 time(s).
>> 2008-07-22 16:35:28,700 INFO org.apache.hadoop.ipc.Client: Retrying 
>> connect to s
>> erver: hermes.cse.sc.edu/129.252.130.148:9000. Already tried 2 time(s).
>> 2008-07-22 16:35:29,700 INFO org.apache.hadoop.ipc.Client: Retrying 
>> connect to s
>> erver: hermes.cse.sc.edu/129.252.130.148:9000. Already tried 3 time(s).
>> 2008-07-22 16:35:30,701 INFO org.apache.hadoop.ipc.Client: Retrying 
>> connect to s
>> erver: hermes.cse.sc.edu/129.252.130.148:9000. Already tried 4 time(s).
>> 2008-07-22 16:35:31,702 INFO org.apache.hadoop.ipc.Client: Retrying 
>> connect to s
>> erver: hermes.cse.sc.edu/129.252.130.148:9000. Already tried 5 time(s).
>> 2008-07-22 16:35:32,702 INFO org.apache.hadoop.ipc.Client: Retrying 
>> connect to s
>> erver: hermes.cse.sc.edu/129.252.130.148:9000. Already tried 6 time(s).
>>
>> same for the tasktrackers (port 9001).
>>
>> I think the problem has something to do with name resolution. Check 
>> these out:
>>
>> [EMAIL PROTECTED]:~/hadoop-0.16.4> telnet hermes.cse.sc.edu 9000
>> Trying 127.0.0.1...
>> Connected to hermes.cse.sc.edu (127.0.0.1).
>> Escape character is '^]'.
>> bye
>> Connection closed by foreign host.
>>
>> [EMAIL PROTECTED]:~/hadoop-0.16.4> host hermes.cse.sc.edu
>> hermes.cse.sc.edu has address 129.252.130.148
>>
>> [EMAIL PROTECTED]:~/hadoop-0.16.4> telnet 129.252.130.148 9000
>> Trying 129.252.130.148...
>> telnet: connect to address 129.252.130.148: Connection refused
>> telnet: Unable to connect to remote host: Connection refused
>>
>> So, the f

Re: Bean Scripting Framework?

2008-07-25 Thread Arun C Murthy


On Jul 25, 2008, at 3:53 PM, Joydeep Sen Sarma wrote:

Just as an aside - there is probably a general perception that  
streaming

is really slow (at least I had it).

The last I did some profiling (in 0.15) - the primary overheads from
streaming came from the scripting language (python is  
sssw). For

an insanely fast script (bin/cat), I saw significant overheads in java
function/data path that drowned out streaming overheads by huge margin
(lot of those overheads have been fixed in recent versions - thanks to
the hadoop team).

Writing a c/c++ streaming program is pretty good way of getting good
performance (and some performance sensitive apps in our environment
ended up doing just that).

Agreed that not all hooks are available.


Hadoop Pipes?

Arun




-Original Message-
From: James Moore [mailto:[EMAIL PROTECTED]
Sent: Friday, July 25, 2008 6:18 AM
To: core-user@hadoop.apache.org
Subject: Re: Bean Scripting Framework?

On Thu, Jul 24, 2008 at 10:48 PM, Venkat Seeth <[EMAIL PROTECTED]>  
wrote:

Why dont you use hadoop streaming?


I think that's more of a broader question - why doesn't everyone use
streaming?

There's no real difference between doing Hadoop in
Ruby/Scala/Java/Jython/whatever - these days, Java is just one of many
languages that run on the JVM.  There's not language-specific reason
to pick streaming over a native implementation if you're working in a
language that has a JVM implementation.  I'm working on a Ruby
interface just because I think there's a space for a nice DSL for
setting up Hadoop and running tasks that's more pleasant for people
used to writing Ruby than the current idioms.

Streaming is great for things that don't run on a JVM - Erlang,
Haskell, Smalltalk, etc.

If you're streaming, though, you loose all the flexibility of Hadoop.
You get line-oriented text in and out, and that's about it.  But if
you want all the Hadoop features, you're going to want to go native,
be it in Ruby, Scala, Java, or whatever your language of choice is.

Streaming is powerful, and huge numbers of solutions of the form
"my_code < data > output" have solved many, many problems over the
years.  If your problem fits in the streaming space, then you should
consider it.   And I think that's a language-neutral statement - just
because your solution is in Java doesn't mean you should bother
hooking it up into a native Hadoop app.

--
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com




RE: Bean Scripting Framework?

2008-07-25 Thread Joydeep Sen Sarma
Just as an aside - there is probably a general perception that streaming
is really slow (at least I had it).

The last I did some profiling (in 0.15) - the primary overheads from
streaming came from the scripting language (python is sssw). For
an insanely fast script (bin/cat), I saw significant overheads in java
function/data path that drowned out streaming overheads by huge margin
(lot of those overheads have been fixed in recent versions - thanks to
the hadoop team).

Writing a c/c++ streaming program is pretty good way of getting good
performance (and some performance sensitive apps in our environment
ended up doing just that). 

Agreed that not all hooks are available.


-Original Message-
From: James Moore [mailto:[EMAIL PROTECTED] 
Sent: Friday, July 25, 2008 6:18 AM
To: core-user@hadoop.apache.org
Subject: Re: Bean Scripting Framework?

On Thu, Jul 24, 2008 at 10:48 PM, Venkat Seeth <[EMAIL PROTECTED]> wrote:
> Why dont you use hadoop streaming?

I think that's more of a broader question - why doesn't everyone use
streaming?

There's no real difference between doing Hadoop in
Ruby/Scala/Java/Jython/whatever - these days, Java is just one of many
languages that run on the JVM.  There's not language-specific reason
to pick streaming over a native implementation if you're working in a
language that has a JVM implementation.  I'm working on a Ruby
interface just because I think there's a space for a nice DSL for
setting up Hadoop and running tasks that's more pleasant for people
used to writing Ruby than the current idioms.

Streaming is great for things that don't run on a JVM - Erlang,
Haskell, Smalltalk, etc.

If you're streaming, though, you loose all the flexibility of Hadoop.
You get line-oriented text in and out, and that's about it.  But if
you want all the Hadoop features, you're going to want to go native,
be it in Ruby, Scala, Java, or whatever your language of choice is.

Streaming is powerful, and huge numbers of solutions of the form
"my_code < data > output" have solved many, many problems over the
years.  If your problem fits in the streaming space, then you should
consider it.   And I think that's a language-neutral statement - just
because your solution is in Java doesn't mean you should bother
hooking it up into a native Hadoop app.

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com


Re: Bean Scripting Framework?

2008-07-25 Thread Andreas Kostyrka
On Friday 25 July 2008 15:18:24 James Moore wrote:
> On Thu, Jul 24, 2008 at 10:48 PM, Venkat Seeth <[EMAIL PROTECTED]> wrote:
> > Why dont you use hadoop streaming?
>
> I think that's more of a broader question - why doesn't everyone use
> streaming?
>
> There's no real difference between doing Hadoop in
> Ruby/Scala/Java/Jython/whatever - these days, Java is just one of many
> languages that run on the JVM.  There's not language-specific reason
> to pick streaming over a native implementation if you're working in a
> language that has a JVM implementation.  I'm working on a Ruby
> interface just because I think there's a space for a nice DSL for
> setting up Hadoop and running tasks that's more pleasant for people
> used to writing Ruby than the current idioms.

Well, there are reasons to go for streaming. It's an acceptable interface.

For many non-Java developers it's by far a nicer interface then trying to use 
the Java APIs, that are by scripting language standards clunky at best.

For a comparison, I managed to write a tar implementation that reads/writes to 
S3 in cpython/boto in less than a day. hdfstar, that is a Jython script took 
me a week. Some time was clearly spent debugging the Java/Jython setup and 
fixing tarfile.py, but most time was spent dealing with the Java APIs to 
read/write HDFS files.

Note please that I did not know S3/boto before either. And note, that I can 
read Java quite fine. (Actually, I can even write it quite fine, if forced by 
an employer :-P )

So if streaming.jar is enough for a given usecase, use it. As a side benefit 
you get a program that you can use differently, without hadoop.

Andreas



>
> Streaming is great for things that don't run on a JVM - Erlang,
> Haskell, Smalltalk, etc.
>
> If you're streaming, though, you loose all the flexibility of Hadoop.
> You get line-oriented text in and out, and that's about it.  But if
> you want all the Hadoop features, you're going to want to go native,
> be it in Ruby, Scala, Java, or whatever your language of choice is.
>
> Streaming is powerful, and huge numbers of solutions of the form
> "my_code < data > output" have solved many, many problems over the
> years.  If your problem fits in the streaming space, then you should
> consider it.   And I think that's a language-neutral statement - just
> because your solution is in Java doesn't mean you should bother
> hooking it up into a native Hadoop app.




signature.asc
Description: This is a digitally signed message part.


Re: Bean Scripting Framework?

2008-07-25 Thread Lincoln Ritter
This is a bit scattered but I wanted to post this in case it might
help someone...

Here's a little more detail on the loading problems I've been having.

For now, I'm just trying to call some ruby from the reduce method of
my map/reduce job.  I want to move to a more general setup, like the
one James Moore proposes above, but I'm taking baby steps due to my
general lack of knowledge regarding hadoop and jruby.

The first problem I encountered was that, from within hadoop, was
unable to load the scripting framework (JSR223) at all. I was getting
this exception (Using JRubyScriptEngineManager):

Exception in thread "main" java.lang.NullPointerException
at org.jruby.runtime.load.LoadService.findFile(LoadService.java:476)
at org.jruby.runtime.load.LoadService.findLibrary(LoadService.java:394)
at org.jruby.runtime.load.LoadService.smartLoad(LoadService.java:259)
at org.jruby.runtime.load.LoadService.require(LoadService.java:349)
at 
com.sun.script.jruby.JRubyScriptEngine.init(JRubyScriptEngine.java:484)
at 
com.sun.script.jruby.JRubyScriptEngine.(JRubyScriptEngine.java:96)
at 
com.sun.script.jruby.JRubyScriptEngineFactory.getScriptEngine(JRubyScriptEngineFactory.java:134)
at 
com.sun.script.jruby.JRubyScriptEngineManager.registerEngineNames(JRubyScriptEngineManager.java:95)
at 
com.sun.script.jruby.JRubyScriptEngineManager.init(JRubyScriptEngineManager.java:72)
at 
com.sun.script.jruby.JRubyScriptEngineManager.(JRubyScriptEngineManager.java:66)
at 
com.sun.script.jruby.JRubyScriptEngineManager.(JRubyScriptEngineManager.java:61)
at com.talentspring.TestMapreduce.dump(TestMapreduce.java:236)
at com.talentspring.TestMapreduce.main(TestMapreduce.java:432)

Poking around the JRubyScriptEngine source
(https://scripting.dev.java.net/source/browse/scripting/engines/jruby/src/com/sun/script/jruby/)
it looks like it uses the property "com.sun.script.jruby.loadpath" and
not "jruby.home" as suggested by
http://wiki.jruby.org/wiki/Java_Integration#Java_6_.28using_JSR_223:_Scripting.29
.  hmmm.

I added -Dcom.sun.script.jruby.loadpath=$JRUBY_HOME to my invocation
and it worked... sort of.  I found that by the time execution reached
the 'configure' method, the load path property was null.   Odd.  Does
anybody know why this might be?  In any case, I saved the value in my
JobConf before submitting the job, like so:

jobConf.set("jruby.load_path",
System.getProperty("com.sun.script.jruby.loadpath"));

Then, in the configure method I have:

System.setProperty("com.sun.script.jruby.loadpath",
jobConf.get("jruby.load_path"));

I then load the script engine and everything works...


So: Does anybody have any idea of why i might be losing the system
load path property when I get to the configure method?

Cheers,
-lincoln

--
lincolnritter.com



On Fri, Jul 25, 2008 at 10:22 AM, Lincoln Ritter
<[EMAIL PROTECTED]> wrote:
> I was using BSF to avoid java 6 issues.  However I'm having similar
> issues using both systems.  Basically, I can't load the scripting
> engine from within hadoop.  I have successfully compiled and run some
> stand-alone test examples but am having trouble getting anything to
> work from hadoop.  One confounding factor is that my development
> machine is OS X 10.5 with the stock 1.5 JDK.  On the surface this
> doesn't seem to be a problem given the success I've had at creating
> small stand-alone tests...  I run the stand-alone stuff with exactly
> the same classpath and environment so it seems that something weird is
> going on.  Additionally, as a sanity check, I've tried loading the
> javascript engine and that does work from within hadoop.
>
> All the JSR jars are on the classpath and I'm kinking off the hadoop
> process using the -Djruby.home=... option.  Did you have to do
> anything special here?
>
> -lincoln
>
> --
> lincolnritter.com
>
>
>
> On Thu, Jul 24, 2008 at 7:00 PM, James Moore <[EMAIL PROTECTED]> wrote:
>> On Thu, Jul 24, 2008 at 3:51 PM, Lincoln Ritter
>> <[EMAIL PROTECTED]> wrote:
>>> Well that sounds awesome!  It would be simply splendid to see what
>>> you've got if you're willing to share.
>>
>> I'll be happy to share, but it's pretty much in pieces, not ready for
>> release.  I'll put it out with whatever license Hadoop itself uses
>> (presumably Apache).
>>
>>>
>>> Are you going the 'direct' embedding route or using a scripting frame
>>> work (BSF or javax.script)?
>>
>> JSR233 is the way to go according to the JRuby guys at RailsConf last
>> month.  It's pretty straightforward - see
>> http://wiki.jruby.org/wiki/Java_Integration#Java_6_.28using_JSR_223:_Scripting.29
>>
>> --
>> James Moore | [EMAIL PROTECTED]
>> Ruby and Ruby on Rails consulting
>> blog.restphone.com
>>
>


Is there a network communication counter for mapred?

2008-07-25 Thread Kevin
Hi,

Besides knowing "data-local" and "rack-local" map task numbers, I am
interested in the size of data that are transferred on network. E.g.,
the size of intermediate map output transferred (not dealt locally). I
wonder if there is such a counter. Thank you.

Best,
-Kevin


Re: Bean Scripting Framework?

2008-07-25 Thread Lincoln Ritter
I was using BSF to avoid java 6 issues.  However I'm having similar
issues using both systems.  Basically, I can't load the scripting
engine from within hadoop.  I have successfully compiled and run some
stand-alone test examples but am having trouble getting anything to
work from hadoop.  One confounding factor is that my development
machine is OS X 10.5 with the stock 1.5 JDK.  On the surface this
doesn't seem to be a problem given the success I've had at creating
small stand-alone tests...  I run the stand-alone stuff with exactly
the same classpath and environment so it seems that something weird is
going on.  Additionally, as a sanity check, I've tried loading the
javascript engine and that does work from within hadoop.

All the JSR jars are on the classpath and I'm kinking off the hadoop
process using the -Djruby.home=... option.  Did you have to do
anything special here?

-lincoln

--
lincolnritter.com



On Thu, Jul 24, 2008 at 7:00 PM, James Moore <[EMAIL PROTECTED]> wrote:
> On Thu, Jul 24, 2008 at 3:51 PM, Lincoln Ritter
> <[EMAIL PROTECTED]> wrote:
>> Well that sounds awesome!  It would be simply splendid to see what
>> you've got if you're willing to share.
>
> I'll be happy to share, but it's pretty much in pieces, not ready for
> release.  I'll put it out with whatever license Hadoop itself uses
> (presumably Apache).
>
>>
>> Are you going the 'direct' embedding route or using a scripting frame
>> work (BSF or javax.script)?
>
> JSR233 is the way to go according to the JRuby guys at RailsConf last
> month.  It's pretty straightforward - see
> http://wiki.jruby.org/wiki/Java_Integration#Java_6_.28using_JSR_223:_Scripting.29
>
> --
> James Moore | [EMAIL PROTECTED]
> Ruby and Ruby on Rails consulting
> blog.restphone.com
>


Re: Using MapReduce to do table comparing.

2008-07-25 Thread James Moore
On Thu, Jul 24, 2008 at 8:03 AM, Amber <[EMAIL PROTECTED]> wrote:
> Yes, I think this is the simplest method , but there are problems too:
>
> 1. The reduce stage wouldn't begin until the map stage ends, by when we have 
> done a two table scanning, and the comparing will take almost the same time, 
> because about 90% of intermediate  pairs will have two values and 
> different keys, if I can specify a number n, by when there are n intermediate 
> pairs with the same key the reduce tasks start, that will be better. In my 
> case I will set the magic number to 2.

I don't think I understood this completely, but I'll try to respond.

First, I think you're going to be doing something like two full table
scans in any case.  Whether it's in an RDBMS or in Hadoop, you need to
read the complete dataset for both day1 and day2.  (Or at least that's
how I interpreted your original mail - you're not trying to keep
deltas over N days, just doing a delta for yesterday/today from
scratch every time)  You could possibly speed this up by keeping some
kind of parsed data in hadoop for previous days, rather than just
text, but I wouldn't do this as my first solution.

It seems like starting the reducers before the maps are done isn't
going to buy you anything.  The same amount of total work needs to be
done; when the work starts doesn't matter much.  In this case, I'm
guessing that you're going to have a setup where (total number of
maps) == (total number of reducers) == 4 * (number of 4-core
machines).

In any case, I'd say you should do some experiments with the most
simple solution you can come up with.  Your problem seems simple
enough that just banging out some throwaway experimental code is going
to a) not take very long, and b) tell you quite a bit about how your
particular solution is going to behave in the real world.

>
> 2. I am not sure about how Hadoop stores intermediate  pairs, we 
> would not afford it as data volume increasing if it is kept in memory.

Hadoop is definitely prepared for very large numbers of intermediate
key/value pairs - that's pretty much the normal case for hadoop jobs.
It'll stream to/from disc as necessary.  Take a look at combiners as
well - they may buy you something.

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com


Re: Hadoop DFS

2008-07-25 Thread hong

HBase is the project that use dfs

if you want to know how to use dfs directly, "bin/hadoop" script may  
be a good entrance.

For example,
"bin/hadoop dfs -cat ***"
*** is a file name in your dfs

follow this command, you can find how to access dfs directly.

Hope it will help you

在 2008-7-25,上午4:09,Wasim Bari 写道:


Hi,
I am new to Hadoop. Right now, I am Only interested to Work  
with Hadoop DFS. Can some one guide me where to start?  Anyone has  
information about some application has already integrated Hadoop DFS ?


Any information regarding Material about Hadoop DFS, case studies,  
Articles, books etc will be very nice.


Thanks,

Wasim





Re: Bean Scripting Framework?

2008-07-25 Thread James Moore
On Thu, Jul 24, 2008 at 10:48 PM, Venkat Seeth <[EMAIL PROTECTED]> wrote:
> Why dont you use hadoop streaming?

I think that's more of a broader question - why doesn't everyone use
streaming?

There's no real difference between doing Hadoop in
Ruby/Scala/Java/Jython/whatever - these days, Java is just one of many
languages that run on the JVM.  There's not language-specific reason
to pick streaming over a native implementation if you're working in a
language that has a JVM implementation.  I'm working on a Ruby
interface just because I think there's a space for a nice DSL for
setting up Hadoop and running tasks that's more pleasant for people
used to writing Ruby than the current idioms.

Streaming is great for things that don't run on a JVM - Erlang,
Haskell, Smalltalk, etc.

If you're streaming, though, you loose all the flexibility of Hadoop.
You get line-oriented text in and out, and that's about it.  But if
you want all the Hadoop features, you're going to want to go native,
be it in Ruby, Scala, Java, or whatever your language of choice is.

Streaming is powerful, and huge numbers of solutions of the form
"my_code < data > output" have solved many, many problems over the
years.  If your problem fits in the streaming space, then you should
consider it.   And I think that's a language-neutral statement - just
because your solution is in Java doesn't mean you should bother
hooking it up into a native Hadoop app.

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com


Re: Bean Scripting Framework?

2008-07-25 Thread Torsten Curdt

That sounds really interesting

On Jul 25, 2008, at 00:42, James Moore wrote:


Funny you should mention it - I'm working on a framework to do JRuby
Hadoop this week.  Something like:

class MyHadoopJob < Radoop
 input_format :text_input_format
 output_format :text_output_format
 map_output_key_class :text
 map_output_value_class :text

 def mapper(k, v, output, reporter)
   # ...
 end

 def reducer(k, vs, output, reporter)
 end
end

Plus a java glue file to call the Ruby stuff.

And then it jars up the ruby files, the gem directory, and goes from  
there.


--
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com