Re: How to write one file per key as mapreduce output

2008-07-29 Thread Lincoln Ritter
Thanks for the info!

> Not sure what happens if you write NULL as key or value.

Looking at the code, it doesn't seem to really make a difference, and
the function in question (basically 'collect') looks to be robust to
null - but I may be missing something!

In my case, I basically want the key to be the output filename, and
the data in the files to be directly consumable by my app.  Having the
key show up in the file complicates things on the app side so I'm
trying to avoid this.  Passing null seems to work for now.


-lincoln

--
lincolnritter.com




On Tue, Jul 29, 2008 at 9:27 AM, Alejandro Abdelnur <[EMAIL PROTECTED]> wrote:
> On Thu, Jul 24, 2008 at 12:32 AM, Lincoln Ritter
> <[EMAIL PROTECTED]> wrote:
>
>> Alejandro said:
>>> Take a look at the MultipleOutputFormat class or MultipleOutputs (in SVN 
>>> tip)
>>
>> I'm muddling through both
>> http://issues.apache.org/jira/browse/HADOOP-2906 and
>> http://issues.apache.org/jira/browse/HADOOP-3149 trying to make sense
>> of these.  I'm a little confused by the way this works but it looks
>> like I can define a number of named outputs which looks like it
>> enables different output formats and I can also define some of these
>> as "multi", meaning that I can write to different "targets" (like
>> files).  Is this correct?
>
> Exactly.
>
> 
>
>> A couple of questions:
>>
>>  - I needed to pass 'null' to the collect method so as to not write
>> the key to the file.  These files are meant to be consumable chunks of
>> content so I want to control exactly what goes into them.  Does this
>> seem normal or am i missing something?  Is there a downside to passing
>> null here?
>
> Not sure what happens if you write NULL as key or value.
>
>>  - What is the 'part-0' file for?  I have seen this in other
>> places in the dfs. But it seems extraneous here.  It's not super
>> critical but if I can make it go away that would be great.
>
> This is the standard output of the M/R job whatever is written the
> OutputCollector you get in the reduce() call (or in the map() call
> when reduce=0)
>
>>  - What is the purpose of the '-r-0' suffix?  Perhaps it is to
>> help with collisions?
>
> Yes, files written from a map have '-m-', files written from a reduce have 
> '-r-'
>
>> I guess it seems strange that I can't just say
>> "the output file should be called X" and have an output file called X
>> appear.
>
> Well, you need the map, reduce mask and the task number mask to avoid
> collisions.
>


Re: Bean Scripting Framework?

2008-07-25 Thread Lincoln Ritter
This is a bit scattered but I wanted to post this in case it might
help someone...

Here's a little more detail on the loading problems I've been having.

For now, I'm just trying to call some ruby from the reduce method of
my map/reduce job.  I want to move to a more general setup, like the
one James Moore proposes above, but I'm taking baby steps due to my
general lack of knowledge regarding hadoop and jruby.

The first problem I encountered was that, from within hadoop, was
unable to load the scripting framework (JSR223) at all. I was getting
this exception (Using JRubyScriptEngineManager):

Exception in thread "main" java.lang.NullPointerException
at org.jruby.runtime.load.LoadService.findFile(LoadService.java:476)
at org.jruby.runtime.load.LoadService.findLibrary(LoadService.java:394)
at org.jruby.runtime.load.LoadService.smartLoad(LoadService.java:259)
at org.jruby.runtime.load.LoadService.require(LoadService.java:349)
at 
com.sun.script.jruby.JRubyScriptEngine.init(JRubyScriptEngine.java:484)
at 
com.sun.script.jruby.JRubyScriptEngine.(JRubyScriptEngine.java:96)
at 
com.sun.script.jruby.JRubyScriptEngineFactory.getScriptEngine(JRubyScriptEngineFactory.java:134)
at 
com.sun.script.jruby.JRubyScriptEngineManager.registerEngineNames(JRubyScriptEngineManager.java:95)
at 
com.sun.script.jruby.JRubyScriptEngineManager.init(JRubyScriptEngineManager.java:72)
at 
com.sun.script.jruby.JRubyScriptEngineManager.(JRubyScriptEngineManager.java:66)
at 
com.sun.script.jruby.JRubyScriptEngineManager.(JRubyScriptEngineManager.java:61)
at com.talentspring.TestMapreduce.dump(TestMapreduce.java:236)
at com.talentspring.TestMapreduce.main(TestMapreduce.java:432)

Poking around the JRubyScriptEngine source
(https://scripting.dev.java.net/source/browse/scripting/engines/jruby/src/com/sun/script/jruby/)
it looks like it uses the property "com.sun.script.jruby.loadpath" and
not "jruby.home" as suggested by
http://wiki.jruby.org/wiki/Java_Integration#Java_6_.28using_JSR_223:_Scripting.29
.  hmmm.

I added -Dcom.sun.script.jruby.loadpath=$JRUBY_HOME to my invocation
and it worked... sort of.  I found that by the time execution reached
the 'configure' method, the load path property was null.   Odd.  Does
anybody know why this might be?  In any case, I saved the value in my
JobConf before submitting the job, like so:

jobConf.set("jruby.load_path",
System.getProperty("com.sun.script.jruby.loadpath"));

Then, in the configure method I have:

System.setProperty("com.sun.script.jruby.loadpath",
jobConf.get("jruby.load_path"));

I then load the script engine and everything works...


So: Does anybody have any idea of why i might be losing the system
load path property when I get to the configure method?

Cheers,
-lincoln

--
lincolnritter.com



On Fri, Jul 25, 2008 at 10:22 AM, Lincoln Ritter
<[EMAIL PROTECTED]> wrote:
> I was using BSF to avoid java 6 issues.  However I'm having similar
> issues using both systems.  Basically, I can't load the scripting
> engine from within hadoop.  I have successfully compiled and run some
> stand-alone test examples but am having trouble getting anything to
> work from hadoop.  One confounding factor is that my development
> machine is OS X 10.5 with the stock 1.5 JDK.  On the surface this
> doesn't seem to be a problem given the success I've had at creating
> small stand-alone tests...  I run the stand-alone stuff with exactly
> the same classpath and environment so it seems that something weird is
> going on.  Additionally, as a sanity check, I've tried loading the
> javascript engine and that does work from within hadoop.
>
> All the JSR jars are on the classpath and I'm kinking off the hadoop
> process using the -Djruby.home=... option.  Did you have to do
> anything special here?
>
> -lincoln
>
> --
> lincolnritter.com
>
>
>
> On Thu, Jul 24, 2008 at 7:00 PM, James Moore <[EMAIL PROTECTED]> wrote:
>> On Thu, Jul 24, 2008 at 3:51 PM, Lincoln Ritter
>> <[EMAIL PROTECTED]> wrote:
>>> Well that sounds awesome!  It would be simply splendid to see what
>>> you've got if you're willing to share.
>>
>> I'll be happy to share, but it's pretty much in pieces, not ready for
>> release.  I'll put it out with whatever license Hadoop itself uses
>> (presumably Apache).
>>
>>>
>>> Are you going the 'direct' embedding route or using a scripting frame
>>> work (BSF or javax.script)?
>>
>> JSR233 is the way to go according to the JRuby guys at RailsConf last
>> month.  It's pretty straightforward - see
>> http://wiki.jruby.org/wiki/Java_Integration#Java_6_.28using_JSR_223:_Scripting.29
>>
>> --
>> James Moore | [EMAIL PROTECTED]
>> Ruby and Ruby on Rails consulting
>> blog.restphone.com
>>
>


Re: Bean Scripting Framework?

2008-07-25 Thread Lincoln Ritter
I was using BSF to avoid java 6 issues.  However I'm having similar
issues using both systems.  Basically, I can't load the scripting
engine from within hadoop.  I have successfully compiled and run some
stand-alone test examples but am having trouble getting anything to
work from hadoop.  One confounding factor is that my development
machine is OS X 10.5 with the stock 1.5 JDK.  On the surface this
doesn't seem to be a problem given the success I've had at creating
small stand-alone tests...  I run the stand-alone stuff with exactly
the same classpath and environment so it seems that something weird is
going on.  Additionally, as a sanity check, I've tried loading the
javascript engine and that does work from within hadoop.

All the JSR jars are on the classpath and I'm kinking off the hadoop
process using the -Djruby.home=... option.  Did you have to do
anything special here?

-lincoln

--
lincolnritter.com



On Thu, Jul 24, 2008 at 7:00 PM, James Moore <[EMAIL PROTECTED]> wrote:
> On Thu, Jul 24, 2008 at 3:51 PM, Lincoln Ritter
> <[EMAIL PROTECTED]> wrote:
>> Well that sounds awesome!  It would be simply splendid to see what
>> you've got if you're willing to share.
>
> I'll be happy to share, but it's pretty much in pieces, not ready for
> release.  I'll put it out with whatever license Hadoop itself uses
> (presumably Apache).
>
>>
>> Are you going the 'direct' embedding route or using a scripting frame
>> work (BSF or javax.script)?
>
> JSR233 is the way to go according to the JRuby guys at RailsConf last
> month.  It's pretty straightforward - see
> http://wiki.jruby.org/wiki/Java_Integration#Java_6_.28using_JSR_223:_Scripting.29
>
> --
> James Moore | [EMAIL PROTECTED]
> Ruby and Ruby on Rails consulting
> blog.restphone.com
>


Re: Bean Scripting Framework?

2008-07-24 Thread Lincoln Ritter
Well that sounds awesome!  It would be simply splendid to see what
you've got if you're willing to share.

Are you going the 'direct' embedding route or using a scripting frame
work (BSF or javax.script)?

-lincoln

--
lincolnritter.com



On Thu, Jul 24, 2008 at 3:42 PM, James Moore <[EMAIL PROTECTED]> wrote:
> Funny you should mention it - I'm working on a framework to do JRuby
> Hadoop this week.  Something like:
>
> class MyHadoopJob < Radoop
>  input_format :text_input_format
>  output_format :text_output_format
>  map_output_key_class :text
>  map_output_value_class :text
>
>  def mapper(k, v, output, reporter)
># ...
>  end
>
>  def reducer(k, vs, output, reporter)
>  end
> end
>
> Plus a java glue file to call the Ruby stuff.
>
> And then it jars up the ruby files, the gem directory, and goes from there.
>
> --
> James Moore | [EMAIL PROTECTED]
> Ruby and Ruby on Rails consulting
> blog.restphone.com
>


Re: Bean Scripting Framework?

2008-07-24 Thread Lincoln Ritter
Andreas,

If you wouldn't mind posting some snippets that would be great!  There
seems to be a general lack of examples out there so pretty much
anything would help.

-lincoln

--
lincolnritter.com



On Thu, Jul 24, 2008 at 3:06 PM, Andreas Kostyrka <[EMAIL PROTECTED]> wrote:
> On Thursday 24 July 2008 23:24:19 Lincoln Ritter wrote:
>> > Why not use jruby?
>>
>> Indeed!  I'm basically working from the JRuby wiki page on Java
>> integration (http://wiki.jruby.org/wiki/Java_Integration).  I'm taking
>> this one step at a time and, while I would love tighter integration,
>> the recommended way is through the scripting frameworks.
>>
>> Right now, I most interested in taking some baby steps before going
>> more general.  I welcome any and all feedback/suggestions.  Especially
>> if you have tried this.  I will post any results if there is interest,
>> but mostly I am trying to accomplish a pretty small task and am not
>> yet thinking about a more general solution.
>
> Guess I won't be a big resource for you then, the only thing that I did was
> implementing a tar program with Jython that creates/extracts from/to HDFS.
>
> It was painful, but not to painful, and it's not Jythons fault, it's just that
> using these clunky interfaces/classes is painful to a Python developer. Guess
> the same feeling will come from Ruby developers.
>
> (and that's not a problem of Hadoop, I think that most Java APIs feel clunky
> to people used to more powerful languages. :-P)
>
> Andreas
>


Re: Bean Scripting Framework?

2008-07-24 Thread Lincoln Ritter
> Why not use jruby?

Indeed!  I'm basically working from the JRuby wiki page on Java
integration (http://wiki.jruby.org/wiki/Java_Integration).  I'm taking
this one step at a time and, while I would love tighter integration,
the recommended way is through the scripting frameworks.

Right now, I most interested in taking some baby steps before going
more general.  I welcome any and all feedback/suggestions.  Especially
if you have tried this.  I will post any results if there is interest,
but mostly I am trying to accomplish a pretty small task and am not
yet thinking about a more general solution.

-lincoln

--
lincolnritter.com



On Thu, Jul 24, 2008 at 1:58 PM, Andreas Kostyrka <[EMAIL PROTECTED]> wrote:
> On Thursday 24 July 2008 21:40:20 Lincoln Ritter wrote:
>> Hello all.
>>
>> Has anybody ever tried/considered using the Bean Scripting Framework
>> within Hadoop?  BSF seems nice since it allows "two-way" communication
>> between ruby and java.  I'd love to hear your thoughts as I've been
>> trying to make this work to allow using ruby in the m/r pipeline.  For
>> now, I don't need a fully general solution, I'd just like to call some
>> ruby in my map or reduce tasks.
>
> Why not use jruby? AFAIK, there is a complete ruby implementation on top of
> Java, and although I have not used it, I'd presume that it allows full usage
> of Java classes, as Jython does.
>
> Andreas
>


Bean Scripting Framework?

2008-07-24 Thread Lincoln Ritter
Hello all.

Has anybody ever tried/considered using the Bean Scripting Framework
within Hadoop?  BSF seems nice since it allows "two-way" communication
between ruby and java.  I'd love to hear your thoughts as I've been
trying to make this work to allow using ruby in the m/r pipeline.  For
now, I don't need a fully general solution, I'd just like to call some
ruby in my map or reduce tasks.

Thanks!

-lincoln

--
lincolnritter.com


Re: How to write one file per key as mapreduce output

2008-07-23 Thread Lincoln Ritter
Thanks for the responses!

James said:
> do you know the maximum number of keys?

No.  I suppose I could compute the number of keys in a separate pass
but that seems pretty icky.

Jason said:
> Where fs is a FileSystem object available via the getFileSystem(conf) method 
> of Path.
>  FSDataOutputStream out = fs.create( destinationFile );
> then write to your out as normal then close it at the end of your reduce body.

This seems very straightforward, but also seems to work outside of the
typical M/R framework; the files created are essentially side effects
and not the "actual" output of the job.  This doesn't seem very clean
to me, but perhaps this is my somewhat shaky understanding of the
paradigm showing through.

Alejandro said:
> Take a look at the MultipleOutputFormat class or MultipleOutputs (in SVN tip)

I'm muddling through both
http://issues.apache.org/jira/browse/HADOOP-2906 and
http://issues.apache.org/jira/browse/HADOOP-3149 trying to make sense
of these.  I'm a little confused by the way this works but it looks
like I can define a number of named outputs which looks like it
enables different output formats and I can also define some of these
as "multi", meaning that I can write to different "targets" (like
files).  Is this correct?

My current test looks like this (Note that I am very new to this so if
I am doing something dumb, please point it out so I can learn):

setup:

job.addInputPath(new Path(segment, Content.DIR_NAME));

job.setInputFormat(SequenceFileInputFormat.class);
job.setMapperClass(InputCompatMapper.class);
job.setReducerClass(TestMapreduce.class);

job.setOutputPath(output);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NutchWritable.class);

MultipleOutputs.addMultiNamedOutput(job, "text",
TextOutputFormat.class, Text.class, Text.class);

reduce:

public void reduce(WritableComparable key, Iterator
values, OutputCollector output, Reporter
reporter) throws IOException {
...
mos.getCollector("text", sha, reporter).collect(null, new
Text(data.toString()));
}

(mos is a MultipleOutputs set in configure(...), and sha is a String)

this seems to have mostly the desired effect, populating my output
directory with files named like
'text_0fe41fb5598a86b6b9f9a7181722a20cba6-r-0' as well as an empty
'part-0' file.

A couple of questions:

 - I needed to pass 'null' to the collect method so as to not write
the key to the file.  These files are meant to be consumable chunks of
content so I want to control exactly what goes into them.  Does this
seem normal or am i missing something?  Is there a downside to passing
null here?

 - What is the 'part-0' file for?  I have seen this in other
places in the dfs. But it seems extraneous here.  It's not super
critical but if I can make it go away that would be great.

 - What is the purpose of the '-r-0' suffix?  Perhaps it is to
help with collisions?  I guess it seems strange that I can't just say
"the output file should be called X" and have an output file called X
appear. I certainly want this process to be as robust as possible, but
I also would like to be able to make this as clean as possible.  If,
say, I can run this job and have it output a bunch of .
files to an S3native fs directly that would be swell, though certainly
I can make this happen in a multi-step process.  Anybody have more
info on this or other ideas?

Thanks so much!  This community is really great and helpful!

-lincoln

--
lincolnritter.com



On Wed, Jul 23, 2008 at 9:07 AM, James Moore <[EMAIL PROTECTED]> wrote:
> On Tue, Jul 22, 2008 at 5:04 PM, Lincoln Ritter
> <[EMAIL PROTECTED]> wrote:
>> Greetings,
>>
>> I would like to write one file per key in the reduce (or map) phase of a
>> mapreduce job.  I have looked at the documentation for
>> FileOutputFormat and MultipleTextOutputFormat but am a bit unclear on
>> how to use it/them.  Can anybody give me a quick pointer?
>
> One way to cheat for the reduce part of this - do you know the maximum
> number of keys?  If so, I think you should be able to just set the
> number of reducers to >= the maximum number of keys.
>
> --
> James Moore | [EMAIL PROTECTED]
> Ruby and Ruby on Rails consulting
> blog.restphone.com
>


How to write one file per key as mapreduce output

2008-07-22 Thread Lincoln Ritter
Greetings,

I have what I think is a pretty straight-forward, noobie question.  I
would like to write one file per key in the reduce (or map) phase of a
mapreduce job.  I have looked at the documentation for
FileOutputFormat and MultipleTextOutputFormat but am a bit unclear on
how to use it/them.  Can anybody give me a quick pointer?

Thanks very much!

-lincoln

--
lincolnritter.com


Re: Namenode Exceptions with S3

2008-07-11 Thread Lincoln Ritter
Thanks Tom!

Your explanation makes things a lot clearer.  I think that changing
the 'fs.default.name' to something like 'dfs.namenode.address' would
certainly be less confusing since it would clarify the purpose of
these values.

-lincoln

--
lincolnritter.com



On Fri, Jul 11, 2008 at 4:21 AM, Tom White <[EMAIL PROTECTED]> wrote:
> On Thu, Jul 10, 2008 at 10:06 PM, Lincoln Ritter
> <[EMAIL PROTECTED]> wrote:
>> Thank you, Tom.
>>
>> Forgive me for being dense, but I don't understand your reply:
>>
>
> Sorry! I'll try to explain it better (see below).
>
>>
>> Do you mean that it is possible to use the Hadoop daemons with S3 but
>> the default filesystem must be HDFS?
>
> The HDFS daemons use the value of "fs.default.name" to set the
> namenode host and port, so if you set it to a s3 URI, you can't run
> the HDFS daemons. So in this case you would use the start-mapred.sh
> script instead of start-all.sh.
>
>> If that is the case, can I
>> specify the output filesystem on a per-job basis and can that be an S3
>> FS?
>
> Yes, that's exactly how you do it.
>
>>
>> Also, is there a particular reason to not allow S3 as the default FS?
>
> You can allow S3 as the default FS, it's just that then you can't run
> HDFS at all in this case. You would only do this if you don't want to
> use HDFS at all, for example, if you were running a MapReduce job
> which read from S3 and wrote to S3.
>
> It might be less confusing if the HDFS daemons didn't use
> fs.default.name to define the namenode host and port. Just like
> mapred.job.tracker defines the host and port for the jobtracker,
> dfs.namenode.address (or similar) could define the namenode. Would
> this be a good change to make?
>
> Tom
>


Re: Namenode Exceptions with S3

2008-07-10 Thread Lincoln Ritter
Thank you, Tom.

Forgive me for being dense, but I don't understand your reply:

> If you make the default filesystem S3 then you can't run HDFS daemons.
> If you want to run HDFS and use an S3 filesystem, you need to make the
> default filesystem a hdfs URI, and use s3 URIs to reference S3
> filesystems.

Do you mean that it is possible to use the Hadoop daemons with S3 but
the default filesystem must be HDFS?  If that is the case, can I
specify the output filesystem on a per-job basis and can that be an S3
FS?

Also, is there a particular reason to not allow S3 as the default FS?

Thanks so much for your time!

-lincoln

--
lincolnritter.com



On Thu, Jul 10, 2008 at 1:55 PM, Tom White <[EMAIL PROTECTED]> wrote:
>> I get (where the all-caps portions are the actual values...):
>>
>> 2008-07-01 19:05:17,540 ERROR org.apache.hadoop.dfs.NameNode:
>> java.lang.NumberFormatException: For input string:
>> "[EMAIL PROTECTED]"
>>at 
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>>at java.lang.Integer.parseInt(Integer.java:447)
>>at java.lang.Integer.parseInt(Integer.java:497)
>>at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:128)
>>at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:121)
>>at org.apache.hadoop.dfs.NameNode.(NameNode.java:178)
>>at org.apache.hadoop.dfs.NameNode.(NameNode.java:164)
>>at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)
>>at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857)
>>
>> These exceptions are taken from the namenode log.  The datanode logs
>> show the same exceptions.
>
> If you make the default filesystem S3 then you can't run HDFS daemons.
> If you want to run HDFS and use an S3 filesystem, you need to make the
> default filesystem a hdfs URI, and use s3 URIs to reference S3
> filesystems.
>
> Hope this helps.
>
> Tom
>


Re: slash in AWS Secret Key, WAS Re: Namenode Exceptions with S3

2008-07-09 Thread Lincoln Ritter
Thanks for the reply.

I've heard the "regenerate" suggestion before, but for organizations
who show aws all over the place this is a huge pain.  I think it would
be better to come up with a more robust solution to handling aws info.

-lincoln

--
lincolnritter.com



On Wed, Jul 9, 2008 at 12:44 PM, Jimmy Lin <[EMAIL PROTECTED]> wrote:
> I've come across this problem before.  My simple solution was to
> regenerate new keys until I got one without a slash... ;)
>
> -Jimmy
>
>> I have Hadoop 0.17.1 and an AWS Secret Key that contains a slash ('/').
>>
>> With distcp, I found that using the URL format s3://ID:[EMAIL PROTECTED]/
>> did not work, even if I encoded the slash as "%2F".  I got
>> "org.jets3t.service.S3ServiceException: S3 HEAD request failed.
>> ResponseCode=403, ResponseMessage=Forbidden"
>>
>> When I put the AWS Secret Key in hadoop-site.xml and wrote the URL as
>> s3://BUCKET/ it worked.
>>
>> I have periods ('.') in my bucket name, that was not a problem.
>>
>> What's weird is that org.apache.hadoop.fs.s3.Jets3tFileSystemStore
>> uses java.net.URI, which should take take of unencoding the %2F.
>>
>> -Stuart
>>
>>
>> On Wed, Jul 9, 2008 at 1:41 PM, Lincoln Ritter
>> <[EMAIL PROTECTED]> wrote:
>>> So far, I've had no luck.
>>>
>>> Can anyone out there clarify the permissible characters/format for aws
>>> keys and bucket names?
>>>
>>> I haven't looked at the code here, but it seems strange to me that the
>>> same restrictions on host/port etc apply given that it's a totally
>>> different system.  I'd love to see exceptions thrown that are
>>> particular to the protocol/subsystem being employed.  The s3 'handler'
>>> (or whatever, might be nice enough to check for format violations and
>>> throw and appropriate exception, for instance.  It might URL-encode
>>> the secret key so that the user doesn't have to worry about this, or
>>> throw an exception notifying the user of a bad format.  Currently,
>>> apparent problems with my s3 settings are throwing exceptions that
>>> give no indication that the problem is actually with those settings.
>>>
>>> My mitigating strategy has been to change my configuration to use
>>> "instance-local" storage (/mnt).  I then copy the results out to s3
>>> using 'distcp'.  This is odd since distcp seems ok with my s3/aws
>>> info.
>>>
>>> I'm still unclear as to the permissible characters in bucket names and
>>> access keys.  I gather '/' is bad in the secret key and that '_' is
>>> bad for bucket names.  Thusfar i have only been able to get buckets to
>>> work in distcp that have only letters in their names, but I haven't
>>> tested to extensively.
>>>
>>> For example, I'd love to use buckets like:
>>> 'com.organization.hdfs.purpose'.  This seems to fail.  Using
>>> 'comorganizationhdfspurpose' works but clearly that is less than
>>> optimal.
>>>
>>> Like I say, I haven't dug into the source yet, but it is curious that
>>> distcp seems to work (at least where s3 is the destination) and hadoop
>>> fails when s3 is used as its storage.
>>>
>>> Anyone who has dealt with these issues, please post!  It will help
>>> make the project better.
>>>
>>> -lincoln
>>>
>>> --
>>> lincolnritter.com
>>>
>>>
>>>
>>> On Wed, Jul 9, 2008 at 7:10 AM, slitz <[EMAIL PROTECTED]> wrote:
>>>> I'm having the exact same problem, any tip?
>>>>
>>>> slitz
>>>>
>>>> On Wed, Jul 2, 2008 at 12:34 AM, Lincoln Ritter
>>>> <[EMAIL PROTECTED]>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I am trying to use S3 with Hadoop 0.17.0 on EC2.  Using this style of
>>>>> configuration:
>>>>>
>>>>> 
>>>>>  fs.default.name
>>>>>  s3://$HDFS_BUCKET
>>>>> 
>>>>>
>>>>> 
>>>>>  fs.s3.awsAccessKeyId
>>>>>  $AWS_ACCESS_KEY_ID
>>>>> 
>>>>>
>>>>> 
>>>>>  fs.s3.awsSecretAccessKey
>>>>>  $AWS_SECRET_ACCESS_KEY
>>>>> 
>>>>>
>>>>> on startup of the cluster with the bu

Re: Namenode Exceptions with S3

2008-07-09 Thread Lincoln Ritter
So far, I've had no luck.

Can anyone out there clarify the permissible characters/format for aws
keys and bucket names?

I haven't looked at the code here, but it seems strange to me that the
same restrictions on host/port etc apply given that it's a totally
different system.  I'd love to see exceptions thrown that are
particular to the protocol/subsystem being employed.  The s3 'handler'
(or whatever, might be nice enough to check for format violations and
throw and appropriate exception, for instance.  It might URL-encode
the secret key so that the user doesn't have to worry about this, or
throw an exception notifying the user of a bad format.  Currently,
apparent problems with my s3 settings are throwing exceptions that
give no indication that the problem is actually with those settings.

My mitigating strategy has been to change my configuration to use
"instance-local" storage (/mnt).  I then copy the results out to s3
using 'distcp'.  This is odd since distcp seems ok with my s3/aws
info.

I'm still unclear as to the permissible characters in bucket names and
access keys.  I gather '/' is bad in the secret key and that '_' is
bad for bucket names.  Thusfar i have only been able to get buckets to
work in distcp that have only letters in their names, but I haven't
tested to extensively.

For example, I'd love to use buckets like:
'com.organization.hdfs.purpose'.  This seems to fail.  Using
'comorganizationhdfspurpose' works but clearly that is less than
optimal.

Like I say, I haven't dug into the source yet, but it is curious that
distcp seems to work (at least where s3 is the destination) and hadoop
fails when s3 is used as its storage.

Anyone who has dealt with these issues, please post!  It will help
make the project better.

-lincoln

--
lincolnritter.com



On Wed, Jul 9, 2008 at 7:10 AM, slitz <[EMAIL PROTECTED]> wrote:
> I'm having the exact same problem, any tip?
>
> slitz
>
> On Wed, Jul 2, 2008 at 12:34 AM, Lincoln Ritter <[EMAIL PROTECTED]>
> wrote:
>
>> Hello,
>>
>> I am trying to use S3 with Hadoop 0.17.0 on EC2.  Using this style of
>> configuration:
>>
>> 
>>  fs.default.name
>>  s3://$HDFS_BUCKET
>> 
>>
>> 
>>  fs.s3.awsAccessKeyId
>>  $AWS_ACCESS_KEY_ID
>> 
>>
>> 
>>  fs.s3.awsSecretAccessKey
>>  $AWS_SECRET_ACCESS_KEY
>> 
>>
>> on startup of the cluster with the bucket having no non-alphabetic
>> characters, I get:
>>
>> 2008-07-01 16:10:49,171 ERROR org.apache.hadoop.dfs.NameNode:
>> java.lang.RuntimeException: Not a host:port pair: X
>>at
>> org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:121)
>>at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:121)
>>at org.apache.hadoop.dfs.NameNode.(NameNode.java:178)
>>at org.apache.hadoop.dfs.NameNode.(NameNode.java:164)
>>at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)
>>at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857)
>>
>> If I use this style of configuration:
>>
>> 
>>  fs.default.name
>>  s3://$AWS_ACCESS_KEY:[EMAIL PROTECTED]
>> 
>>
>> I get (where the all-caps portions are the actual values...):
>>
>> 2008-07-01 19:05:17,540 ERROR org.apache.hadoop.dfs.NameNode:
>> java.lang.NumberFormatException: For input string:
>> "[EMAIL PROTECTED]"
>>at
>> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>>at java.lang.Integer.parseInt(Integer.java:447)
>>at java.lang.Integer.parseInt(Integer.java:497)
>>at
>> org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:128)
>>at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:121)
>>at org.apache.hadoop.dfs.NameNode.(NameNode.java:178)
>>at org.apache.hadoop.dfs.NameNode.(NameNode.java:164)
>>at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)
>>at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857)
>>
>> These exceptions are taken from the namenode log.  The datanode logs
>> show the same exceptions.
>>
>> Other than the above configuration changes, the configuration is
>> identical to that generate by the hadoop image creation script found
>> in the 0.17.0 distribution.
>>
>> Can anybody point me in the right direction here?
>>
>> -lincoln
>>
>> --
>> lincolnritter.com
>>
>


Namenode Exceptions with S3

2008-07-01 Thread Lincoln Ritter
Hello,

I am trying to use S3 with Hadoop 0.17.0 on EC2.  Using this style of
configuration:


  fs.default.name
  s3://$HDFS_BUCKET



  fs.s3.awsAccessKeyId
  $AWS_ACCESS_KEY_ID



  fs.s3.awsSecretAccessKey
  $AWS_SECRET_ACCESS_KEY


on startup of the cluster with the bucket having no non-alphabetic
characters, I get:

2008-07-01 16:10:49,171 ERROR org.apache.hadoop.dfs.NameNode:
java.lang.RuntimeException: Not a host:port pair: X
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:121)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:121)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:178)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:164)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857)

If I use this style of configuration:


  fs.default.name
  s3://$AWS_ACCESS_KEY:[EMAIL PROTECTED]


I get (where the all-caps portions are the actual values...):

2008-07-01 19:05:17,540 ERROR org.apache.hadoop.dfs.NameNode:
java.lang.NumberFormatException: For input string:
"[EMAIL PROTECTED]"
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Integer.parseInt(Integer.java:447)
at java.lang.Integer.parseInt(Integer.java:497)
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:128)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:121)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:178)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:164)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857)

These exceptions are taken from the namenode log.  The datanode logs
show the same exceptions.

Other than the above configuration changes, the configuration is
identical to that generate by the hadoop image creation script found
in the 0.17.0 distribution.

Can anybody point me in the right direction here?

-lincoln

--
lincolnritter.com


Re: Job History Logging Location

2008-06-25 Thread Lincoln Ritter
Hello again.

I answered my own question.

Setting 'hadoop.job.history.user.location' to 'logs' works fine.

Thanks anyway!

-lincoln

--
lincolnritter.com



On Wed, Jun 25, 2008 at 11:11 AM, Lincoln Ritter
<[EMAIL PROTECTED]> wrote:
> Greetings,
>
> I'm trying to get a handle on job history logging.  According to the
> documentation in 'hadoop-defaul.xml' the
> 'hadoop.job.history.user.location' determines where job history logs
> are written.  If not specified these logs go into
> '/_logs/history'.  This can cause problems with
> applications that don't know about this convention.  It would also be
> nicer in my opinion to keep logs and data separate.
>
> It seems to me that a nice way to handle this would be to put logs in
> '/logs//history' or something.
>
> Can this be done?  Is there a need for the "job-id" folder?  If this
> can't be done, are there alternatives that work well.
>
> -lincoln
>
> --
> lincolnritter.com
>


Job History Logging Location

2008-06-25 Thread Lincoln Ritter
Greetings,

I'm trying to get a handle on job history logging.  According to the
documentation in 'hadoop-defaul.xml' the
'hadoop.job.history.user.location' determines where job history logs
are written.  If not specified these logs go into
'/_logs/history'.  This can cause problems with
applications that don't know about this convention.  It would also be
nicer in my opinion to keep logs and data separate.

It seems to me that a nice way to handle this would be to put logs in
'/logs//history' or something.

Can this be done?  Is there a need for the "job-id" folder?  If this
can't be done, are there alternatives that work well.

-lincoln

--
lincolnritter.com


Job History Logging Location

2008-06-25 Thread Lincoln Ritter
Greetings,

I'm trying to get a handle on job history logging.  According to the
documentation in 'hadoop-defaul.xml' the
'hadoop.job.history.user.location' determines where job history logs
are written.  If not specified these logs go into
'/_logs/history'.  This can cause problems with
applications that don't know about this convention.  It would also be
nicer in my opinion to keep logs and data separate.

It seems to me that a nice way to handle this would be to put logs in
'/logs//history' or something.

Can this be done?  Is there a need for the "job-id" folder?  If this
can't be done, are there alternatives that work well.

-lincoln

--
lincolnritter.com