Re: Hadoop cluster build, machine specs

2008-04-04 Thread Bradford Stephens
Greetings,

It really depends on your budget. What are you looking to spend? $5k?
$20k? Hadoop is about bringing the calculations to your data, so the
more machines you can have, the better.

In general, I'd recommend Dual-Core Opterons and 2-4 GB of RAM with an
SATA hard drive. My company just ordered five such machines from Dell
for Hadoop goodness, and I think the total came to around eight grand.

Another alternative is Amazon EC2 and S3, of course. It all depends on
what you want to do.


On Fri, Apr 4, 2008 at 5:27 PM, Ted Dziuba <[EMAIL PROTECTED]> wrote:
> Hi all,
>
>  I'm looking to build a small, 5-10 node cluster to run mostly CPU-bound
> Hadoop jobs.  I'm shying away from the 8-core behemoth type machines for
> cost reasons.  But what about dual core machines?  32 or 64 bits?
>
>  I'm still in the planning stages, so any advice would be greatly
> appreciated.
>
>  Thanks,
>
>  Ted
>


Hadoop cluster build, machine specs

2008-04-04 Thread Ted Dziuba

Hi all,

I'm looking to build a small, 5-10 node cluster to run mostly CPU-bound 
Hadoop jobs.  I'm shying away from the 8-core behemoth type machines for 
cost reasons.  But what about dual core machines?  32 or 64 bits?


I'm still in the planning stages, so any advice would be greatly 
appreciated.


Thanks,

Ted


Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Aayush Garg
Please give your inputs for my problem.

Thanks,


On Sat, Apr 5, 2008 at 1:10 AM, Robert Dempsey <[EMAIL PROTECTED]> wrote:

> Ted,
>
> It appears that Nutch hasn't been updated in a while (in Internet time at
> least). Do you know if it works with the latest versions of Hadoop? Thanks.
>
> - Robert Dempsey (new to the list)
>
>
> On Apr 4, 2008, at 5:36 PM, Ted Dunning wrote:
>
> >
> >
> > See Nutch.  See Nutch run.
> >
> > http://en.wikipedia.org/wiki/Nutch
> > http://lucene.apache.org/nutch/
> >
>


-- 
Aayush Garg,
Phone: +41 76 482 240


Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Robert Dempsey

Ted,

It appears that Nutch hasn't been updated in a while (in Internet time  
at least). Do you know if it works with the latest versions of Hadoop?  
Thanks.


- Robert Dempsey (new to the list)

On Apr 4, 2008, at 5:36 PM, Ted Dunning wrote:



See Nutch.  See Nutch run.

http://en.wikipedia.org/wiki/Nutch
http://lucene.apache.org/nutch/


RE: secondary namenode web interface

2008-04-04 Thread dhruba Borthakur
Your configuration is good. The secondary Namenode does not publish a
web interface. The "null pointer" message in the secondary Namenode log
is a harmless bug but should be fixed. It would be nice if you can open
a JIRA for it.

Thanks,
Dhruba


-Original Message-
From: Yuri Pradkin [mailto:[EMAIL PROTECTED] 
Sent: Friday, April 04, 2008 2:45 PM
To: core-user@hadoop.apache.org
Subject: Re: secondary namenode web interface

I'm re-posting this in hope that someone would help.  Thanks!

On Wednesday 02 April 2008 01:29:45 pm Yuri Pradkin wrote:
> Hi,
>
> I'm running Hadoop (latest snapshot) on several machines and in our
setup
> namenode and secondarynamenode are on different systems.  I see from
the
> logs than secondary namenode regularly checkpoints fs from primary
> namenode.
>
> But when I go to the secondary namenode HTTP
(dfs.secondary.http.address)
> in my browser I see something like this:
>
>   HTTP ERROR: 500
>   init
>   RequestURI=/dfshealth.jsp
>   Powered by Jetty://
>
> And in secondary's log I find these lines:
>
> 2008-04-02 11:26:25,357 WARN /: /dfshealth.jsp:
> java.lang.NullPointerException
> at
> org.apache.hadoop.dfs.dfshealth_jsp.(dfshealth_jsp.java:21) at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
>
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorA
cce
>ssorImpl.java:57) at
>
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCons
tru
>ctorAccessorImpl.java:45) at
> java.lang.reflect.Constructor.newInstance(Constructor.java:539) at
> java.lang.Class.newInstance0(Class.java:373)
> at java.lang.Class.newInstance(Class.java:326)
> at
org.mortbay.jetty.servlet.Holder.newInstance(Holder.java:199)
> at
>
org.mortbay.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:32
6)
> at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:405)
> at
>
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH
and
>ler.java:475) at
>
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
at
> org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at
>
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon
tex
>t.java:635) at
org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at
> org.mortbay.http.HttpServer.service(HttpServer.java:954) at
> org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at
> org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981) at
> org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at
>
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244
)
> at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at
> org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
>
> Is something missing from my configuration?  Anybody else seen these?
>
> Thanks,
>
>   -Yuri




Re: secondary namenode web interface

2008-04-04 Thread Yuri Pradkin
I'm re-posting this in hope that someone would help.  Thanks!

On Wednesday 02 April 2008 01:29:45 pm Yuri Pradkin wrote:
> Hi,
>
> I'm running Hadoop (latest snapshot) on several machines and in our setup
> namenode and secondarynamenode are on different systems.  I see from the
> logs than secondary namenode regularly checkpoints fs from primary
> namenode.
>
> But when I go to the secondary namenode HTTP (dfs.secondary.http.address)
> in my browser I see something like this:
>
>   HTTP ERROR: 500
>   init
>   RequestURI=/dfshealth.jsp
>   Powered by Jetty://
>
> And in secondary's log I find these lines:
>
> 2008-04-02 11:26:25,357 WARN /: /dfshealth.jsp:
> java.lang.NullPointerException
> at
> org.apache.hadoop.dfs.dfshealth_jsp.(dfshealth_jsp.java:21) at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAcce
>ssorImpl.java:57) at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstru
>ctorAccessorImpl.java:45) at
> java.lang.reflect.Constructor.newInstance(Constructor.java:539) at
> java.lang.Class.newInstance0(Class.java:373)
> at java.lang.Class.newInstance(Class.java:326)
> at org.mortbay.jetty.servlet.Holder.newInstance(Holder.java:199)
> at
> org.mortbay.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:326)
> at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:405)
> at
> org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHand
>ler.java:475) at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567) at
> org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at
> org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContex
>t.java:635) at org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at
> org.mortbay.http.HttpServer.service(HttpServer.java:954) at
> org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at
> org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981) at
> org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at
> org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
> at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at
> org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
>
> Is something missing from my configuration?  Anybody else seen these?
>
> Thanks,
>
>   -Yuri




Re: Streaming + custom input format

2008-04-04 Thread Yuri Pradkin
It does work for me.  I have to BOTH ship the extra jar using -file AND 
include in classpath on local system (via setting HADOOP_CLASSPATH).
I'm not sure what "nothing happened" means.  BTW, I'm using the 0.16.2 
release.

On Friday 04 April 2008 10:19:54 am Francesco Tamberi wrote:
> I already tried that... nothing happened...
> Thank you,
> -- Francesco
>
> Ted Dunning ha scritto:
> > I saw that, but I don't know if it will put a jar into the classpath at
> > the other end.


Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Ted Dunning


See Nutch.  See Nutch run.

http://en.wikipedia.org/wiki/Nutch
http://lucene.apache.org/nutch/



On 4/4/08 1:22 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> I have not used lucene index ever before. I do not get how we build it with
> hadoop Map reduce. Basically what I was looking for like how to implement
> multilevel map/reduce for my mentioned problem.
> 
> 
> On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <[EMAIL PROTECTED]> wrote:
> 
>> You can build Lucene indexes using Hadoop Map/Reduce. See the index
>> contrib package in the trunk. Or is it still not something you are
>> looking for?
>> 
>> Regards,
>> Ning
>> 
>> On 4/4/08, Aayush Garg <[EMAIL PROTECTED]> wrote:
>>> No, currently my requirement is to solve this problem by apache hadoop.
>> I am
>>> trying to build up this type of inverted index and then measure
>> performance
>>> criteria with respect to others.
>>> 
>>> Thanks,
>>> 
>>> 
>>> On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>>> 
 
 Are you implementing this for instruction or production?
 
 If production, why not use Lucene?
 
 
 On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
 
> HI  Amar , Theodore, Arun,
> 
> Thanks for your reply. Actaully I am new to hadoop so cant figure
>> out
 much.
> I have written following code for inverted index. This code maps
>> each
 word
> from the document to its document id.
> ex: apple file1 file123
> Main functions of the code are:-
> 
> public class HadoopProgram extends Configured implements Tool {
> public static class MapClass extends MapReduceBase
> implements Mapper {
> 
> private final static IntWritable one = new IntWritable(1);
> private Text word = new Text();
> private Text doc = new Text();
> private long numRecords=0;
> private String inputFile;
> 
>public void configure(JobConf job){
> System.out.println("Configure function is called");
> inputFile = job.get("map.input.file");
> System.out.println("In conf the input file is"+inputFile);
> }
> 
> 
> public void map(LongWritable key, Text value,
> OutputCollector output,
> Reporter reporter) throws IOException {
>   String line = value.toString();
>   StringTokenizer itr = new StringTokenizer(line);
>   doc.set(inputFile);
>   while (itr.hasMoreTokens()) {
> word.set(itr.nextToken());
> output.collect(word,doc);
>   }
>   if(++numRecords%4==0){
>System.out.println("Finished processing of input
 file"+inputFile);
>  }
> }
>   }
> 
>   /**
>* A reducer class that just emits the sum of the input values.
>*/
>   public static class Reduce extends MapReduceBase
> implements Reducer {
> 
>   // This works as K2, V2, K3, V3
> public void reduce(Text key, Iterator values,
>OutputCollector output,
>Reporter reporter) throws IOException {
>   int sum = 0;
>   Text dummy = new Text();
>   ArrayList IDs = new ArrayList();
>   String str;
> 
>   while (values.hasNext()) {
>  dummy = values.next();
>  str = dummy.toString();
>  IDs.add(str);
>}
>DocIDs dc = new DocIDs();
>dc.setListdocs(IDs);
>   output.collect(key,dc);
> }
>   }
> 
>  public int run(String[] args) throws Exception {
>   System.out.println("Run function is called");
> JobConf conf = new JobConf(getConf(), WordCount.class);
> conf.setJobName("wordcount");
> 
> // the keys are words (strings)
> conf.setOutputKeyClass(Text.class);
> 
> conf.setOutputValueClass(Text.class);
> 
> 
> conf.setMapperClass(MapClass.class);
> 
> conf.setReducerClass(Reduce.class);
> }
> 
> 
> Now I am getting output array from the reducer as:-
> word \root\test\test123, \root\test12
> 
> In the next stage I want to stop 'stop  words',  scrub words etc.
>> and
 like
> position of the word in the document. How would I apply multiple
>> maps or
> multilevel map reduce jobs programmatically? I guess I need to make
 another
> class or add some functions in it? I am not able to figure it out.
> Any pointers for these type of problems?
> 
> Thanks,
> Aayush
> 
> 
> On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]>
 wrote:
> 
>> On Wed, 26 Mar 2008, Aayush Garg wrote:
>> 
>>> HI,
>>> I am developing the simple inverted index program frm the hadoop.
>> My
 map
>>> function has the output:
>>> 
>>> and the reducer ha

Re: on number of input files and split size

2008-04-04 Thread Prasan Ary
So it seems best for my application if I can somehow consolidate smaller files 
into a couple of large files.
   
  All of my files reside on S3, and I am using 'distcp' command to copy them to 
hdfs on EC2 before running a MR job. I was thinking it would be nice if I could 
modify distcp such that each EC2 image running 'distcp' on the EC2 cluster will 
concatenate input files into single file, so that at the end of the copy 
process , we will have as many files as there are machines in the cluster. 
   
  Any thoughts if how I should proceeed on this ? or if this is a good idea at 
all ?
   
  

Ted Dunning <[EMAIL PROTECTED]> wrote:
  
The split will depend entirely on the input format that you use and the
files that you have. In your case, you have lots of very small files so the
limiting factor will almost certainly be the number of files. Thus, you
will have 1000 splits (one per file).

Your performance, btw, will likely be pretty poor with so many small files.
Can you consolidate them? 100MB of data should probably be in no more than
a few files if you want good performance. At that, most kinds of processing
will be completely dominated by job startup time. If your jobs are I/O
bound, they will be able to read 100MB of data in a just a few seconds at
most. Startup time for a hadoop job is typically 10 seconds or more.


On 4/4/08 12:58 PM, "Prasan Ary" wrote:

> I have a question on how input files are split before they are given out to
> Map functions.
> Say I have an input directory containing 1000 files whose total size is 100
> MB, and I have 10 machines in my cluster and I have configured 10
> mapred.map.tasks in hadoop-site.xml.
> 
> 1. With this configuration, do we have a way to know what size each split
> will be of?
> 2. Does split size depend on how many files there are in the input
> directory? What if I have only 10 files in input directory, but the total size
> of all these files is still 100 MB? Will it affect split size?
> 
> Thanks.
> 
> 
> -
> You rock. That's why Blockbuster's offering you one month of Blockbuster Total
> Access, No Cost.



   
-
You rock. That's why Blockbuster's offering you one month of Blockbuster Total 
Access, No Cost.

S3 Exception

2008-04-04 Thread Craig Blake
Is there any additional configuration needed to run against S3 besides  
these instructions?


http://wiki.apache.org/hadoop/AmazonS3

Following the instructions on that page, when I try to run "start- 
dfs.sh" I see the following exception in the logs:


2008-04-04 17:03:31,345 ERROR org.apache.hadoop.dfs.NameNode:  
java.lang.IllegalArgumentException: port out of range:-1

at java.net.InetSocketAddress.(InetSocketAddress.java:118)
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:125)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:119)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:176)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:162)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)

The node fails to start and exits after this exception is thrown.

The bucket name I am using is of the format 'myuniqueprefix-hadoop'.   
I am using Hadoop 0.16.2 on OSX.  I didn't see any Jira issues for  
this or any resolution in previous email threads.  Is there a  
workaround for this, or have I missed some necessary step in  
configuration?


Thanks,
Craig B


Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Aayush Garg
Hi,

I have not used lucene index ever before. I do not get how we build it with
hadoop Map reduce. Basically what I was looking for like how to implement
multilevel map/reduce for my mentioned problem.


On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <[EMAIL PROTECTED]> wrote:

> You can build Lucene indexes using Hadoop Map/Reduce. See the index
> contrib package in the trunk. Or is it still not something you are
> looking for?
>
> Regards,
> Ning
>
> On 4/4/08, Aayush Garg <[EMAIL PROTECTED]> wrote:
> > No, currently my requirement is to solve this problem by apache hadoop.
> I am
> > trying to build up this type of inverted index and then measure
> performance
> > criteria with respect to others.
> >
> > Thanks,
> >
> >
> > On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> >
> > >
> > > Are you implementing this for instruction or production?
> > >
> > > If production, why not use Lucene?
> > >
> > >
> > > On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
> > >
> > > > HI  Amar , Theodore, Arun,
> > > >
> > > > Thanks for your reply. Actaully I am new to hadoop so cant figure
> out
> > > much.
> > > > I have written following code for inverted index. This code maps
> each
> > > word
> > > > from the document to its document id.
> > > > ex: apple file1 file123
> > > > Main functions of the code are:-
> > > >
> > > > public class HadoopProgram extends Configured implements Tool {
> > > > public static class MapClass extends MapReduceBase
> > > > implements Mapper {
> > > >
> > > > private final static IntWritable one = new IntWritable(1);
> > > > private Text word = new Text();
> > > > private Text doc = new Text();
> > > > private long numRecords=0;
> > > > private String inputFile;
> > > >
> > > >public void configure(JobConf job){
> > > > System.out.println("Configure function is called");
> > > > inputFile = job.get("map.input.file");
> > > > System.out.println("In conf the input file is"+inputFile);
> > > > }
> > > >
> > > >
> > > > public void map(LongWritable key, Text value,
> > > > OutputCollector output,
> > > > Reporter reporter) throws IOException {
> > > >   String line = value.toString();
> > > >   StringTokenizer itr = new StringTokenizer(line);
> > > >   doc.set(inputFile);
> > > >   while (itr.hasMoreTokens()) {
> > > > word.set(itr.nextToken());
> > > > output.collect(word,doc);
> > > >   }
> > > >   if(++numRecords%4==0){
> > > >System.out.println("Finished processing of input
> > > file"+inputFile);
> > > >  }
> > > > }
> > > >   }
> > > >
> > > >   /**
> > > >* A reducer class that just emits the sum of the input values.
> > > >*/
> > > >   public static class Reduce extends MapReduceBase
> > > > implements Reducer {
> > > >
> > > >   // This works as K2, V2, K3, V3
> > > > public void reduce(Text key, Iterator values,
> > > >OutputCollector output,
> > > >Reporter reporter) throws IOException {
> > > >   int sum = 0;
> > > >   Text dummy = new Text();
> > > >   ArrayList IDs = new ArrayList();
> > > >   String str;
> > > >
> > > >   while (values.hasNext()) {
> > > >  dummy = values.next();
> > > >  str = dummy.toString();
> > > >  IDs.add(str);
> > > >}
> > > >DocIDs dc = new DocIDs();
> > > >dc.setListdocs(IDs);
> > > >   output.collect(key,dc);
> > > > }
> > > >   }
> > > >
> > > >  public int run(String[] args) throws Exception {
> > > >   System.out.println("Run function is called");
> > > > JobConf conf = new JobConf(getConf(), WordCount.class);
> > > > conf.setJobName("wordcount");
> > > >
> > > > // the keys are words (strings)
> > > > conf.setOutputKeyClass(Text.class);
> > > >
> > > > conf.setOutputValueClass(Text.class);
> > > >
> > > >
> > > > conf.setMapperClass(MapClass.class);
> > > >
> > > > conf.setReducerClass(Reduce.class);
> > > > }
> > > >
> > > >
> > > > Now I am getting output array from the reducer as:-
> > > > word \root\test\test123, \root\test12
> > > >
> > > > In the next stage I want to stop 'stop  words',  scrub words etc.
> and
> > > like
> > > > position of the word in the document. How would I apply multiple
> maps or
> > > > multilevel map reduce jobs programmatically? I guess I need to make
> > > another
> > > > class or add some functions in it? I am not able to figure it out.
> > > > Any pointers for these type of problems?
> > > >
> > > > Thanks,
> > > > Aayush
> > > >
> > > >
> > > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]>
> > > wrote:
> > > >
> > > >> On Wed, 26 Mar 2008, Aayush Garg wrote:
> > > >>
> > > >>> HI,
> > > >>> I am developing the simple inverted index program frm the hadoop.
> My
> > > map
> > > >>> function has the output:
> > > >>> 
> > > >>> and the reducer has

Re: more noob questions--how/when is data 'distributed' across a cluster?

2008-04-04 Thread Ted Dunning

I should add that systems like Pig and JAQL aim to satisfy your needs very
nicely.  They may or may not be ready for your needs, but they aren't
terribly far away.

Also, you should consider whether it is better for you to have a system that
is considered "industry standard" (aka fully relational) or "somewhat
experimental or avante garde".  Different situations could force one answer
or the other.


On 4/4/08 11:48 AM, "Paul Danese" <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> Currently I have a large (for me) amount of data stored in a relational
> database (3 tables: each with 2 - 10 million related records. This is an
> oversimplification, but for clarity it's close enough).
> 
> There is a relatively simple Object-relational Mapping (ORM) to my
> database:  Specifically, my parent Object is called "Accident".
> Accidents can have 1 or more Report objects (has many).
> Reports can have 1 or more Outcome objects (has many).
> 
> Each of these Objects maps to a specific table in my RDBMS w/ foreign keys
> 'connecting' records between tables.
> 
> I run searches against this database (using Lucene) and this works quite
> well as long as I return only *subsets* of the *total* result-set at any one
> time.
> e.g. I may have 25,000 hits ("Accidents") that meet my threshold Lucene
> score, but as long as I only query the database for 50 Accident "objects" at
> any one time, the response time is great.
> 
> The 'problem' is that I'd also like to use those 25,000 Accidents to
> generate an electronic report as **quickly as possible**
> (right now it takes about 30 minutes to collect all 25,000 hits from the
> database, extract the relevant fields and construct the actual report).
> Most of this 30 minutes is spent hitting the database and
> processing/extracting the relevant data (generating the report is rather
> fast once all the data are properly formatted).
> 
> So...at my naive level, this seems like a decent job for hadoop.
> ***QUESTION 1: Is this an accurate belief?***
> 
> i.e., I have a semi-large collection of key/value pairs (25,000 Accident IDs
> would be the keys, and 25,000 Accident objects would be values)
> 
> These object/value pairs are "mapped" on a cluster, extracting the relevant
> data from each object.
> The mapping then releases a new set of "key/value" pairs (in this case,
> emitted keys are one of three categories (accident, report, outcome) and the
> values are arrays of accident, report and outcome data that will go into the
> report).
> 
> These emitted key/value pairs are then "reduced" and resulting reduced
> collections are used to build the report.
> 
> ***QUESTION 2:  If the answer to Q1 is "yes", how does one typically "move"
> data from a rdbms to something like HDFS/HBase?***
> ***QUESTION 3:  Am I right in thinking that my HBase data are going to be
> denormalized relative to my RDBMS?***
> ***QUESTION 4:  How are the data within an HBase database *actually*
> distributed amongst nodes?  i.e. is the distribution done automatically upon
> creating the db (as long as the cluster already exists?)  Or do you have to
> issue some type of command that says "okay...here's the HBase db, distribute
> it to nodes a - z"***
> ***QUESTION 5:  Or is this whole problem something better addressed by some
> type of high-performance rdbms cluster?***
> ***QUESTION 6:  Is there a detailed (step by step) tutorial on how to use
> HBase w/ Hadoop?***
> 
> 
> Anyway, apologies if this is the 1000th time this has been answered and
> thank you for any insight!



Re: more noob questions--how/when is data 'distributed' across a cluster?

2008-04-04 Thread Ted Dunning



On 4/4/08 11:48 AM, "Paul Danese" <[EMAIL PROTECTED]> wrote:
> [ ... Extract and report on 25,000 out of 10^6 records ...]
> 
> So...at my naive level, this seems like a decent job for hadoop.
> ***QUESTION 1: Is this an accurate belief?***

Sounds just right.

On 10 loser machines, it is feasible to expect to be able to scan 100MB per
second so if your records (with associated data) is about 1GB, it should
take about 10 seconds to pass over your data.  With allowances for system
reality you should be able to do something interesting in a minute or so.

> [ ... Map does projection and object tagging, reduce does reporting ...]]]

Your process outline looks fine.

> 
> ***QUESTION 2:  If the answer to Q1 is "yes", how does one typically "move"
> data from a rdbms to something like HDFS/HBase?***

There are copy commands.  Typically, you partition your data by date of
insertion and dump new records into new files.  Occasionally, you might
merge older records to limit the number of total files.

I don't see any need for Hbase here, but since the two things that Hbase
does really well are insertion and table scan, it would be pretty natural to
use.  You should check with the hbase guys to see how long it will take to
produce the output you want, but I would expect that you could get your
25000 rows faster than a raw scan using hadoop alone.

> ***QUESTION 3:  Am I right in thinking that my HBase data are going to be
> denormalized relative to my RDBMS?***

Very likely.

> ***QUESTION 5:  Or is this whole problem something better addressed by some
> type of high-performance rdbms cluster?***

This would be a very modest sized Oracle data set, but I would guess that
the costs would make Hadoop preferable.  It isn't even all that large for
mySQL.  The existence of very nice reporting software for either of these
could tip the balance the other way, however.

> ***QUESTION 6:  Is there a detailed (step by step) tutorial on how to use
> HBase w/ Hadoop?***

I think  you could start with hadoop alone.




Re: on number of input files and split size

2008-04-04 Thread Ted Dunning

The split will depend entirely on the input format that you use and the
files that you have.  In your case, you have lots of very small files so the
limiting factor will almost certainly be the number of files.  Thus, you
will have 1000 splits (one per file).

Your performance, btw, will likely be pretty poor with so many small files.
Can you consolidate them?  100MB of data should probably be in no more than
a few files if you want good performance.  At that, most kinds of processing
will be completely dominated by job startup time.  If your jobs are I/O
bound, they will be able to read 100MB of data in a just a few seconds at
most.  Startup time for a hadoop job is typically 10 seconds or more.


On 4/4/08 12:58 PM, "Prasan Ary" <[EMAIL PROTECTED]> wrote:

> I have a question on how input files are split before they are given out to
> Map functions.
>   Say I have an input directory containing  1000 files whose total size is 100
> MB, and I have 10 machines in my cluster and I have configured 10
> mapred.map.tasks in hadoop-site.xml.
>
>   1. With this configuration, do we have a way to know what size each split
> will be of?
>   2. Does split size depend on how many files there are in the input
> directory? What if I have only 10 files in input directory, but the total size
> of all these files is still 100 MB? Will it affect split size?
>
>   Thanks.
> 
>
> -
> You rock. That's why Blockbuster's offering you one month of Blockbuster Total
> Access, No Cost.



Re: distcp fails when copying from s3 to hdfs

2008-04-04 Thread s29752-hadoopuser
Your distcp command looks correct.  distcp may have created some log files 
(e.g. inside /_distcp_logs_5vzva5 from your previous email.)  Could you check 
the logs, see whether there are error messages?

If you could send me the distcp output and the logs, I may be able to find out 
the problem.  (remember to remove the id:secret   :)

Nicholas


- Original Message 
From: Siddhartha Reddy <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Friday, April 4, 2008 12:23:17 PM
Subject: Re: distcp fails when copying from s3 to hdfs

I am sorry, that was a mistype in my mail. The second command was (please
note the / at the end):

bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls /


I guess you are right, Nicholas. The
s3://id:[EMAIL PROTECTED]/file.txtindeed does not seem to be there.
But the earlier distcp command to copy the
file to S3 finished without errors. Once again, the command I am using to
copy the file to S3 is:

bin/hadoop distcp file.txt s3://id:[EMAIL PROTECTED]/file.txt

Am I doing anything wrong here?

Thanks,
Siddhartha


On Fri, Apr 4, 2008 at 11:38 PM, <[EMAIL PROTECTED]> wrote:

> >To check that the file actually exists on S3, I tried the following
> commands:
> >
> >bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls
> >bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls
> >
> >The first returned nothing, while the second returned the following:
> >
> >Found 1 items
> >/_distcp_logs_5vzva5   1969-12-31 19:00rwxrwxrwx
>
>
> Are the first and the second commands the same?  (why they return
> different results?)  It seems that distcp is right:
> s3://id:[EMAIL PROTECTED]/file.txt indeed does not exist from the output
> of your the second command.
>
> Nicholas
>
>


-- 
http://sids.in
"If you are not having fun, you are not doing it right."





on number of input files and split size

2008-04-04 Thread Prasan Ary
I have a question on how input files are split before they are given out to Map 
functions.
  Say I have an input directory containing  1000 files whose total size is 100 
MB, and I have 10 machines in my cluster and I have configured 10 
mapred.map.tasks in hadoop-site.xml.
   
  1. With this configuration, do we have a way to know what size each split 
will be of?
  2. Does split size depend on how many files there are in the input directory? 
What if I have only 10 files in input directory, but the total size of all 
these files is still 100 MB? Will it affect split size?
   
  Thanks.

   
-
You rock. That's why Blockbuster's offering you one month of Blockbuster Total 
Access, No Cost.

Re: Error msg: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration

2008-04-04 Thread Anisio Mendes Lacerda
Thanks a lot for help. Only for register I would like to post which
enviroment variables
I set:

export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun-1.5.0.13
export OS_NAME=linux
export OS_ARCH=i386
export LIBHDFS_BUILD_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/libhdfs
export SHLIB_VERSION=1

export HADOOP_HOME=/mnt/hd1/hadoop/hadoop-0.14.4
export HADOOP_CONF_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/conf
export HADOOP_LOG_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/logs

export
LD_LIBRARY_PATH=/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/jre/lib/i386/server:/mnt/hd1/hadoop/hadoop-0.14.4/libhdfs

export
CLASSPATH=/mnt/hd1/hadoop/hadoop-0.14.4/hadoop-0.14.4-core.jar:/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/jre/lib/i386:/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/jre/lib/i386/server:/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/jre/lib/i386/client:/usr/share/java:/usr/share/java/commons-logging-api.jar:/usr/share/java/:/usr/share/java/commons-logging.jar:/usr/share/java/log4j-1.2.jar

export
HADOOP_CLASSPATH=/mnt/hd1/hadoop/hadoop-0.14.4/hadoop-0.14.4-core.jar:/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/jre/lib/i386:/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/jre/lib/i386/server:/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/jre/lib/i386/client:/usr/share/java:/usr/share/java/commons-logging-api.jar:/usr/share/java/:/usr/share/java/commons-logging.jar:/usr/share/java/log4j-1.2.jar

See you,

Anisio


On Thu, Apr 3, 2008 at 8:51 AM, Peeyush Bishnoi <[EMAIL PROTECTED]>
wrote:

> Hello ,
>
> As this problem is related to CLASSPATh of hadoop , so just set the
> HADOOP_CLASSPATH or CLASSPATH with hadoop core jar
>
> ---
> Peeyush
>
>
> On Wed, 2008-04-02 at 13:51 -0300, Anisio Mendes Lacerda wrote:
>
> > Hi,
> >
> > me and my coleagues are implementing a small search engine in my
> University
> > Laboratory,
> > and we would like to use Hadoop as file system.
> >
> > For now we are having troubles in running the following simple code
> example:
> >
> > 
> > #include "hdfs.h"
> > int main(int argc, char **argv) {
> > hdfsFS fs = hdfsConnect("apolo.latin.dcc.ufmg.br", 51070);
> > if(!fs) {
> > fprintf(stderr, "Oops! Failed to connect to hdfs!\n");
> > exit(-1);
> > }
> > int result = hdfsDisconnect(fs);
> > if(!result) {
> > fprintf(stderr, "Oops! Failed to connect to hdfs!\n");
> > exit(-1);
> > }
> > }
> > 
> >
> > The error msg is:
> >
> > Exception in thread "main" java.lang.NoClassDefFoundError:
> > org/apache/hadoop/conf/Configuration
> >
> > 
> >
> > We configured the following enviroment variables:
> >
> > export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun-1.5.0.13
> > export OS_NAME=linux
> > export OS_ARCH=i386
> > export LIBHDFS_BUILD_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/libhdfs
> > export SHLIB_VERSION=1
> >
> > export HADOOP_HOME=/mnt/hd1/hadoop/hadoop-0.14.4
> > export HADOOP_CONF_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/conf
> > export HADOOP_LOG_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/logs
> >
> > The following commands were used to compile the codes:
> >
> > In directory: hadoop-0.14.4/src/c++/libhdfs
> >
> > 1 - make all
> > 2 - gcc -I/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/include/ -c
> my_hdfs_test.c
> > 3 - gcc my_hdfs_test.o -I/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/include/
> > -L/mnt/hd1/hadoop/hadoop-0.14.4/libhdfs -lhdfs
> > -L/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/jre/lib/i386/server -ljvm  -o
> > my_hdfs_test
> >
> > Obs:
> >
> > The hadoop file system seems to be ok once we can run commands like
> this:
> >
> > [EMAIL PROTECTED]:/mnt/hd1/hadoop/hadoop-0.14.4$ bin/hadoop dfs -ls
> > Found 0 items
> >
> >
> >
> >
> >
> >
>



-- 
[]s

Anisio Mendes Lacerda


Re: distcp fails when copying from s3 to hdfs

2008-04-04 Thread Siddhartha Reddy
I am sorry, that was a mistype in my mail. The second command was (please
note the / at the end):

bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls /


I guess you are right, Nicholas. The
s3://id:[EMAIL PROTECTED]/file.txtindeed does not seem to be there.
But the earlier distcp command to copy the
file to S3 finished without errors. Once again, the command I am using to
copy the file to S3 is:

bin/hadoop distcp file.txt s3://id:[EMAIL PROTECTED]/file.txt

Am I doing anything wrong here?

Thanks,
Siddhartha


On Fri, Apr 4, 2008 at 11:38 PM, <[EMAIL PROTECTED]> wrote:

> >To check that the file actually exists on S3, I tried the following
> commands:
> >
> >bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls
> >bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls
> >
> >The first returned nothing, while the second returned the following:
> >
> >Found 1 items
> >/_distcp_logs_5vzva5   1969-12-31 19:00rwxrwxrwx
>
>
> Are the first and the second commands the same?  (why they return
> different results?)  It seems that distcp is right:
> s3://id:[EMAIL PROTECTED]/file.txt indeed does not exist from the output
> of your the second command.
>
> Nicholas
>
>


-- 
http://sids.in
"If you are not having fun, you are not doing it right."


more noob questions--how/when is data 'distributed' across a cluster?

2008-04-04 Thread Paul Danese
Hi,

Currently I have a large (for me) amount of data stored in a relational
database (3 tables: each with 2 - 10 million related records. This is an
oversimplification, but for clarity it's close enough).

There is a relatively simple Object-relational Mapping (ORM) to my
database:  Specifically, my parent Object is called "Accident".
Accidents can have 1 or more Report objects (has many).
Reports can have 1 or more Outcome objects (has many).

Each of these Objects maps to a specific table in my RDBMS w/ foreign keys
'connecting' records between tables.

I run searches against this database (using Lucene) and this works quite
well as long as I return only *subsets* of the *total* result-set at any one
time.
e.g. I may have 25,000 hits ("Accidents") that meet my threshold Lucene
score, but as long as I only query the database for 50 Accident "objects" at
any one time, the response time is great.

The 'problem' is that I'd also like to use those 25,000 Accidents to
generate an electronic report as **quickly as possible**
(right now it takes about 30 minutes to collect all 25,000 hits from the
database, extract the relevant fields and construct the actual report).
Most of this 30 minutes is spent hitting the database and
processing/extracting the relevant data (generating the report is rather
fast once all the data are properly formatted).

So...at my naive level, this seems like a decent job for hadoop.
***QUESTION 1: Is this an accurate belief?***

i.e., I have a semi-large collection of key/value pairs (25,000 Accident IDs
would be the keys, and 25,000 Accident objects would be values)

These object/value pairs are "mapped" on a cluster, extracting the relevant
data from each object.
The mapping then releases a new set of "key/value" pairs (in this case,
emitted keys are one of three categories (accident, report, outcome) and the
values are arrays of accident, report and outcome data that will go into the
report).

These emitted key/value pairs are then "reduced" and resulting reduced
collections are used to build the report.

***QUESTION 2:  If the answer to Q1 is "yes", how does one typically "move"
data from a rdbms to something like HDFS/HBase?***
***QUESTION 3:  Am I right in thinking that my HBase data are going to be
denormalized relative to my RDBMS?***
***QUESTION 4:  How are the data within an HBase database *actually*
distributed amongst nodes?  i.e. is the distribution done automatically upon
creating the db (as long as the cluster already exists?)  Or do you have to
issue some type of command that says "okay...here's the HBase db, distribute
it to nodes a - z"***
***QUESTION 5:  Or is this whole problem something better addressed by some
type of high-performance rdbms cluster?***
***QUESTION 6:  Is there a detailed (step by step) tutorial on how to use
HBase w/ Hadoop?***


Anyway, apologies if this is the 1000th time this has been answered and
thank you for any insight!


Re: distcp fails when copying from s3 to hdfs

2008-04-04 Thread s29752-hadoopuser
>To check that the file actually exists on S3, I tried the following commands:
>
>bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls
>bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls
>
>The first returned nothing, while the second returned the following:
>
>Found 1 items
>/_distcp_logs_5vzva5   1969-12-31 19:00rwxrwxrwx


Are the first and the second commands the same?  (why they return different 
results?)  It seems that distcp is right: s3://id:[EMAIL PROTECTED]/file.txt 
indeed does not exist from the output of your the second command.

Nicholas



Re: Streaming + custom input format

2008-04-04 Thread Ted Dunning



On 4/4/08 10:18 AM, "Francesco Tamberi" <[EMAIL PROTECTED]> wrote:

> Thank for your fast reply!
> 
> Ted Dunning ha scritto:
>> Take a looks at the way that the text input format moves to the next line
>> after a split point.
>> 
>>   
> I'm not sure to understand.. is my way correct or are you suggesting
> another one?

I am not sure if I was suggesting something different, but I think it was.

It sounded like you were going to find good split points in the getSplits
method.

The TextInputFormat doesn't do try to be so clever since that would involve
(serialized) reading of parts of the file.  Instead, it picks the break
points *WITHOUT* reference to the contents of the files.  Then, when the
mapper lights up, the input format jumps to the assigned point, reads the
remainder of the line at that point and then starts sending full lines to
the mapper, continuing until it hits the end of the file OR passes the
beginning of the next split.  This means that it may read additional data
after the assigned end point, but that extra data is guaranteed to be
ignored by the input format in charge of reading that split.

This is a very clever and simple solution to the problem that depends only
on being able to find a boundary between records.  If you can do that, then
you are golden.



Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Ning Li
You can build Lucene indexes using Hadoop Map/Reduce. See the index
contrib package in the trunk. Or is it still not something you are
looking for?

Regards,
Ning

On 4/4/08, Aayush Garg <[EMAIL PROTECTED]> wrote:
> No, currently my requirement is to solve this problem by apache hadoop. I am
> trying to build up this type of inverted index and then measure performance
> criteria with respect to others.
>
> Thanks,
>
>
> On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
> >
> > Are you implementing this for instruction or production?
> >
> > If production, why not use Lucene?
> >
> >
> > On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
> >
> > > HI  Amar , Theodore, Arun,
> > >
> > > Thanks for your reply. Actaully I am new to hadoop so cant figure out
> > much.
> > > I have written following code for inverted index. This code maps each
> > word
> > > from the document to its document id.
> > > ex: apple file1 file123
> > > Main functions of the code are:-
> > >
> > > public class HadoopProgram extends Configured implements Tool {
> > > public static class MapClass extends MapReduceBase
> > > implements Mapper {
> > >
> > > private final static IntWritable one = new IntWritable(1);
> > > private Text word = new Text();
> > > private Text doc = new Text();
> > > private long numRecords=0;
> > > private String inputFile;
> > >
> > >public void configure(JobConf job){
> > > System.out.println("Configure function is called");
> > > inputFile = job.get("map.input.file");
> > > System.out.println("In conf the input file is"+inputFile);
> > > }
> > >
> > >
> > > public void map(LongWritable key, Text value,
> > > OutputCollector output,
> > > Reporter reporter) throws IOException {
> > >   String line = value.toString();
> > >   StringTokenizer itr = new StringTokenizer(line);
> > >   doc.set(inputFile);
> > >   while (itr.hasMoreTokens()) {
> > > word.set(itr.nextToken());
> > > output.collect(word,doc);
> > >   }
> > >   if(++numRecords%4==0){
> > >System.out.println("Finished processing of input
> > file"+inputFile);
> > >  }
> > > }
> > >   }
> > >
> > >   /**
> > >* A reducer class that just emits the sum of the input values.
> > >*/
> > >   public static class Reduce extends MapReduceBase
> > > implements Reducer {
> > >
> > >   // This works as K2, V2, K3, V3
> > > public void reduce(Text key, Iterator values,
> > >OutputCollector output,
> > >Reporter reporter) throws IOException {
> > >   int sum = 0;
> > >   Text dummy = new Text();
> > >   ArrayList IDs = new ArrayList();
> > >   String str;
> > >
> > >   while (values.hasNext()) {
> > >  dummy = values.next();
> > >  str = dummy.toString();
> > >  IDs.add(str);
> > >}
> > >DocIDs dc = new DocIDs();
> > >dc.setListdocs(IDs);
> > >   output.collect(key,dc);
> > > }
> > >   }
> > >
> > >  public int run(String[] args) throws Exception {
> > >   System.out.println("Run function is called");
> > > JobConf conf = new JobConf(getConf(), WordCount.class);
> > > conf.setJobName("wordcount");
> > >
> > > // the keys are words (strings)
> > > conf.setOutputKeyClass(Text.class);
> > >
> > > conf.setOutputValueClass(Text.class);
> > >
> > >
> > > conf.setMapperClass(MapClass.class);
> > >
> > > conf.setReducerClass(Reduce.class);
> > > }
> > >
> > >
> > > Now I am getting output array from the reducer as:-
> > > word \root\test\test123, \root\test12
> > >
> > > In the next stage I want to stop 'stop  words',  scrub words etc. and
> > like
> > > position of the word in the document. How would I apply multiple maps or
> > > multilevel map reduce jobs programmatically? I guess I need to make
> > another
> > > class or add some functions in it? I am not able to figure it out.
> > > Any pointers for these type of problems?
> > >
> > > Thanks,
> > > Aayush
> > >
> > >
> > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]>
> > wrote:
> > >
> > >> On Wed, 26 Mar 2008, Aayush Garg wrote:
> > >>
> > >>> HI,
> > >>> I am developing the simple inverted index program frm the hadoop. My
> > map
> > >>> function has the output:
> > >>> 
> > >>> and the reducer has:
> > >>> 
> > >>>
> > >>> Now I want to use one more mapreduce to remove stop and scrub words
> > from
> > >> Use distributed cache as Arun mentioned.
> > >>> this output. Also in the next stage I would like to have short summay
> > >> Whether to use a separate MR job depends on what exactly you mean by
> > >> summary. If its like a window around the current word then you can
> > >> possibly do it in one go.
> > >> Amar
> > >>> associated with every word. How should I design my program from this
> > >> stage?
> > >>> I mean how would I apply multiple map

Re: Streaming + custom input format

2008-04-04 Thread Francesco Tamberi

I already tried that... nothing happened...
Thank you,
-- Francesco

Ted Dunning ha scritto:

I saw that, but I don't know if it will put a jar into the classpath at the
other end.


On 4/4/08 9:56 AM, "Yuri Pradkin" <[EMAIL PROTECTED]> wrote:

  

There is a -file option to streaming that
-file  File/dir to be shipped in the Job jar file

On Friday 04 April 2008 09:24:59 am Ted Dunning wrote:


At one point, it
was necessary to unpack the streaming.jar file and put your own classes and
jars into that.  Last time I looked at the code, however, there was support
for that happening magically, but in the 30 seconds I have allotted to help
you (sorry bout that), I can't see that there is a command line option to
trigger that, unless it is the one for including a file in the jar file.
  



  


Re: Streaming + custom input format

2008-04-04 Thread Francesco Tamberi

Thank for your fast reply!

Ted Dunning ha scritto:

Take a looks at the way that the text input format moves to the next line
after a split point.

  
I'm not sure to understand.. is my way correct or are you suggesting 
another one?

There are a couple of possible problems with your input format not found
problem.

First, is your input in a package?  If so, you need to provide a complete
name for the class.

  
I forget to explain but I provided the complete class name (i.e. 
org.myName.ClassName)

Secondly, you have to give streaming information about how to package up
your input format class for transfer to the cluster.  Having access to the
class on your initial invoking machine is not sufficient.  At one point, it
was necessary to unpack the streaming.jar file and put your own classes and
jars into that.  Last time I looked at the code, however, there was support
for that happening magically, but in the 30 seconds I have allotted to help
you (sorry bout that), I can't see that there is a command line option to
trigger that, unless it is the one for including a file in the jar file.

  
I'll try to include my jar/class in streaming.jar.. is no SO clean but 
it would be great if it works! I'll keep you informed ;)

Thank you again

On 4/4/08 3:00 AM, "Francesco Tamberi" <[EMAIL PROTECTED]> wrote:

  

Hi All,
I have a streaming tool chain written in c++/python that performs some
operations on really big text files (gigabytes order); the chain reads files
and writes its result to standard output.
The chain needs to read well structured files and so I need to control how
hadoop splits files: it should splits a file only at suitable places.
What's the best way to do that?
I'm trying defining a custom input format in that way but I'm not sure it's
ok:

public class MyInputFormat extends FileInputFormat {
...

public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException {
...
}
}

That said, I tried to run that (on hadoop 0.15.3, 0.16.0, 0.16.1) with:

$HADOOP_HOME/bin/hadoop jar
$HADOOP_HOME/contrib/streaming/hadoop-0.16.1-streaming.jar -file ./allTools.sh
-mapper "allTools.sh" -jobconf mapred.reduce.tasks=0 -file pathToMyClass.class
-inputformat MyClass -input test.txt -output test-output

But it raises an exception "-inputformat : class not found : MyClass"
I tried passing a jar instead of class file, putting them in HADOOP_CLASSPATH,
putting in system' CLASSPATH but always the same result..

Thank you for your patience!
-- Francesco






  


Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Aayush Garg
No, currently my requirement is to solve this problem by apache hadoop. I am
trying to build up this type of inverted index and then measure performance
criteria with respect to others.

Thanks,


On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
> Are you implementing this for instruction or production?
>
> If production, why not use Lucene?
>
>
> On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
>
> > HI  Amar , Theodore, Arun,
> >
> > Thanks for your reply. Actaully I am new to hadoop so cant figure out
> much.
> > I have written following code for inverted index. This code maps each
> word
> > from the document to its document id.
> > ex: apple file1 file123
> > Main functions of the code are:-
> >
> > public class HadoopProgram extends Configured implements Tool {
> > public static class MapClass extends MapReduceBase
> > implements Mapper {
> >
> > private final static IntWritable one = new IntWritable(1);
> > private Text word = new Text();
> > private Text doc = new Text();
> > private long numRecords=0;
> > private String inputFile;
> >
> >public void configure(JobConf job){
> > System.out.println("Configure function is called");
> > inputFile = job.get("map.input.file");
> > System.out.println("In conf the input file is"+inputFile);
> > }
> >
> >
> > public void map(LongWritable key, Text value,
> > OutputCollector output,
> > Reporter reporter) throws IOException {
> >   String line = value.toString();
> >   StringTokenizer itr = new StringTokenizer(line);
> >   doc.set(inputFile);
> >   while (itr.hasMoreTokens()) {
> > word.set(itr.nextToken());
> > output.collect(word,doc);
> >   }
> >   if(++numRecords%4==0){
> >System.out.println("Finished processing of input
> file"+inputFile);
> >  }
> > }
> >   }
> >
> >   /**
> >* A reducer class that just emits the sum of the input values.
> >*/
> >   public static class Reduce extends MapReduceBase
> > implements Reducer {
> >
> >   // This works as K2, V2, K3, V3
> > public void reduce(Text key, Iterator values,
> >OutputCollector output,
> >Reporter reporter) throws IOException {
> >   int sum = 0;
> >   Text dummy = new Text();
> >   ArrayList IDs = new ArrayList();
> >   String str;
> >
> >   while (values.hasNext()) {
> >  dummy = values.next();
> >  str = dummy.toString();
> >  IDs.add(str);
> >}
> >DocIDs dc = new DocIDs();
> >dc.setListdocs(IDs);
> >   output.collect(key,dc);
> > }
> >   }
> >
> >  public int run(String[] args) throws Exception {
> >   System.out.println("Run function is called");
> > JobConf conf = new JobConf(getConf(), WordCount.class);
> > conf.setJobName("wordcount");
> >
> > // the keys are words (strings)
> > conf.setOutputKeyClass(Text.class);
> >
> > conf.setOutputValueClass(Text.class);
> >
> >
> > conf.setMapperClass(MapClass.class);
> >
> > conf.setReducerClass(Reduce.class);
> > }
> >
> >
> > Now I am getting output array from the reducer as:-
> > word \root\test\test123, \root\test12
> >
> > In the next stage I want to stop 'stop  words',  scrub words etc. and
> like
> > position of the word in the document. How would I apply multiple maps or
> > multilevel map reduce jobs programmatically? I guess I need to make
> another
> > class or add some functions in it? I am not able to figure it out.
> > Any pointers for these type of problems?
> >
> > Thanks,
> > Aayush
> >
> >
> > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]>
> wrote:
> >
> >> On Wed, 26 Mar 2008, Aayush Garg wrote:
> >>
> >>> HI,
> >>> I am developing the simple inverted index program frm the hadoop. My
> map
> >>> function has the output:
> >>> 
> >>> and the reducer has:
> >>> 
> >>>
> >>> Now I want to use one more mapreduce to remove stop and scrub words
> from
> >> Use distributed cache as Arun mentioned.
> >>> this output. Also in the next stage I would like to have short summay
> >> Whether to use a separate MR job depends on what exactly you mean by
> >> summary. If its like a window around the current word then you can
> >> possibly do it in one go.
> >> Amar
> >>> associated with every word. How should I design my program from this
> >> stage?
> >>> I mean how would I apply multiple mapreduce to this? What would be the
> >>> better way to perform this?
> >>>
> >>> Thanks,
> >>>
> >>> Regards,
> >>> -
> >>>
> >>>
> >>
>
>


-- 
Aayush Garg,
Phone: +41 76 482 240


Re: Streaming + custom input format

2008-04-04 Thread Ted Dunning

I saw that, but I don't know if it will put a jar into the classpath at the
other end.


On 4/4/08 9:56 AM, "Yuri Pradkin" <[EMAIL PROTECTED]> wrote:

> There is a -file option to streaming that
> -file  File/dir to be shipped in the Job jar file
> 
> On Friday 04 April 2008 09:24:59 am Ted Dunning wrote:
>> At one point, it
>> was necessary to unpack the streaming.jar file and put your own classes and
>> jars into that.  Last time I looked at the code, however, there was support
>> for that happening magically, but in the 30 seconds I have allotted to help
>> you (sorry bout that), I can't see that there is a command line option to
>> trigger that, unless it is the one for including a file in the jar file.
> 
> 



Re: Streaming + custom input format

2008-04-04 Thread Yuri Pradkin
There is a -file option to streaming that
-file  File/dir to be shipped in the Job jar file

On Friday 04 April 2008 09:24:59 am Ted Dunning wrote:
> At one point, it
> was necessary to unpack the streaming.jar file and put your own classes and
> jars into that.  Last time I looked at the code, however, there was support
> for that happening magically, but in the 30 seconds I have allotted to help
> you (sorry bout that), I can't see that there is a command line option to
> trigger that, unless it is the one for including a file in the jar file.




Re: Is it possible in Hadoop to overwrite or update a file?

2008-04-04 Thread Ted Dunning

My suggestion actually is similar to what bigtable and hbase do.

That is to keep some recent updates in memory, burping them to disk at
relatively frequent intervals.  Then when a number of burps are available,
they can be merged to a larger burp.  This pyramid can be extended as
needed.

Searches then would have to probe the files at each level to find the latest
version of the record.

File append or a highly reliable in-memory storage for the current partial
burp (something like zookeeper or replicated memcache?) is required for this
approach to not lose recent data on machine failure, but it has the
potential for high write rates while maintaining reasonable read rates.  Of
course, truly random reads will kill you, but reads biased toward recently
updated records should have awesome performance.


On 4/4/08 5:04 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:

> Ted Dunning wrote:
> 
>> This factor of 1500 in speed seems pretty significant and is the motivation
>> for not supporting random read/write.
>> 
>> This doesn't mean that random access update should never be done, but it
>> does mean that scaling a design based around random access will be more
>> difficult than scaling a design based on sequential read and write.
>> 
>> On 4/3/08 12:07 PM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:
>> 
>>> In general, if updates are relatively frequent and small compared to the
>>> size of data then this could be useful.
>> 
>> 
> 
> Hehe ... yes, good calculations :) What I had in mind though when saying
> "relatively frequent" was rather a situation when updates are usually
> small and come at unpredictable intervals (e.g. picked from a queue
> listener) and then need to set flags on a few records. Running
> sequential update in face of such minor changes doesn't usually pay off,
> and queueing the changes so that it starts to pay off is sometimes not
> possible (takes too long to fill the batch).
> 



Re: Streaming + custom input format

2008-04-04 Thread Ted Dunning

Take a looks at the way that the text input format moves to the next line
after a split point.

There are a couple of possible problems with your input format not found
problem.

First, is your input in a package?  If so, you need to provide a complete
name for the class.

Secondly, you have to give streaming information about how to package up
your input format class for transfer to the cluster.  Having access to the
class on your initial invoking machine is not sufficient.  At one point, it
was necessary to unpack the streaming.jar file and put your own classes and
jars into that.  Last time I looked at the code, however, there was support
for that happening magically, but in the 30 seconds I have allotted to help
you (sorry bout that), I can't see that there is a command line option to
trigger that, unless it is the one for including a file in the jar file.


On 4/4/08 3:00 AM, "Francesco Tamberi" <[EMAIL PROTECTED]> wrote:

> Hi All,
> I have a streaming tool chain written in c++/python that performs some
> operations on really big text files (gigabytes order); the chain reads files
> and writes its result to standard output.
> The chain needs to read well structured files and so I need to control how
> hadoop splits files: it should splits a file only at suitable places.
> What's the best way to do that?
> I'm trying defining a custom input format in that way but I'm not sure it's
> ok:
> 
> public class MyInputFormat extends FileInputFormat {
> ...
> 
> public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException {
> ...
> }
> }
> 
> That said, I tried to run that (on hadoop 0.15.3, 0.16.0, 0.16.1) with:
> 
> $HADOOP_HOME/bin/hadoop jar
> $HADOOP_HOME/contrib/streaming/hadoop-0.16.1-streaming.jar -file ./allTools.sh
> -mapper "allTools.sh" -jobconf mapred.reduce.tasks=0 -file pathToMyClass.class
> -inputformat MyClass -input test.txt -output test-output
> 
> But it raises an exception "-inputformat : class not found : MyClass"
> I tried passing a jar instead of class file, putting them in HADOOP_CLASSPATH,
> putting in system' CLASSPATH but always the same result..
> 
> Thank you for your patience!
> -- Francesco
> 
> 
> 



Re: Newbie and basic questions

2008-04-04 Thread Travis Brady
Hi Alberto,

Here's my take as someone from the traditional RDBMS world who has been
experimenting with Hadoop for a month, so don't take my comments to be
definitive.

On Fri, Apr 4, 2008 at 7:57 AM, Alberto Mesas <[EMAIL PROTECTED]>
wrote:

> We have been reading some doc, and playing with the basic samples that
> come
> with Hadoop 0.16.2. So let's see if we have understood everything :)
>
> We plan to use HCore for processing our logs, but would it be possible to
> use it for a case like this one?
>
> MySQL Table with a few thousands of new rows every hour.
> Every unique row must send by HTTP/GET to remote servers.
> Update the row with the HTTP/GET result.
> If HTTP/GET fails, retry again later this row.
>
> We can't do twice the same unique row, so we must be sure that each is
> sent
> once and only once, well at least until we get a valid response.


This really sounds like an OLTP type system, your use of the word UPDATE
pegs it as a pretty good candidate for a traditional row-store RDBMS.
Hadoop is really directed at read intensive stuff.  A good example would be
analyzing log files and generating traffic statistics based on their content
in large batch jobs. There is no question that you could build a system to
solve your problem with Hadoop but it would take more doing than doing it
with Oracle/Postgres/whatever.  You might have a look at HBase though:
http://wiki.apache.org/hadoop/Hbase

I think of Hadoop as a very general tool that can be used to spread large
tasks across cluster of machines.  I think this fact is somewhat obscured by
the comparisons often made between hadoop/mapreduce and relational
databases.

>
>
> Right now  we are doing it using a simple java app that selects N rows
> from
> the table, and uses N threads to do N HTTP/GETs.

Honestly if you have a simple app doing this for you I'd say stick with
that.


> If it's possible to implement it in Hadoop, we get easy ScaleOut and High
> Availability.
>
> ScalingUp is becoming  a little prob, and we have no real HA.
>
> I have doubts like, how to "select unique rows" and give them to Hadoop,
> etc...
>
> Any advice is welcome :)
>
hope that helps


-- 
Travis Brady
www.mochiads.com


Re: If I wanna read a config file before map task, which class I should choose?

2008-04-04 Thread Ted Dunning

Just write a parser and put it into the configure method.


On 4/3/08 8:31 PM, "Jeremy Chow" <[EMAIL PROTECTED]> wrote:

> thanks, the configure file format looks like below,
> 
> @tag_name0 name0 {value00, value01, value02}
> @tag_name1 name1 {value10, value11, value12}
> 
> and reading it from HDFS. Then how can I parse them ?



Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Ted Dunning

Are you implementing this for instruction or production?

If production, why not use Lucene?


On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:

> HI  Amar , Theodore, Arun,
> 
> Thanks for your reply. Actaully I am new to hadoop so cant figure out much.
> I have written following code for inverted index. This code maps each word
> from the document to its document id.
> ex: apple file1 file123
> Main functions of the code are:-
> 
> public class HadoopProgram extends Configured implements Tool {
> public static class MapClass extends MapReduceBase
> implements Mapper {
> 
> private final static IntWritable one = new IntWritable(1);
> private Text word = new Text();
> private Text doc = new Text();
> private long numRecords=0;
> private String inputFile;
> 
>public void configure(JobConf job){
> System.out.println("Configure function is called");
> inputFile = job.get("map.input.file");
> System.out.println("In conf the input file is"+inputFile);
> }
> 
> 
> public void map(LongWritable key, Text value,
> OutputCollector output,
> Reporter reporter) throws IOException {
>   String line = value.toString();
>   StringTokenizer itr = new StringTokenizer(line);
>   doc.set(inputFile);
>   while (itr.hasMoreTokens()) {
> word.set(itr.nextToken());
> output.collect(word,doc);
>   }
>   if(++numRecords%4==0){
>System.out.println("Finished processing of input file"+inputFile);
>  }
> }
>   }
> 
>   /**
>* A reducer class that just emits the sum of the input values.
>*/
>   public static class Reduce extends MapReduceBase
> implements Reducer {
> 
>   // This works as K2, V2, K3, V3
> public void reduce(Text key, Iterator values,
>OutputCollector output,
>Reporter reporter) throws IOException {
>   int sum = 0;
>   Text dummy = new Text();
>   ArrayList IDs = new ArrayList();
>   String str;
> 
>   while (values.hasNext()) {
>  dummy = values.next();
>  str = dummy.toString();
>  IDs.add(str);
>}
>DocIDs dc = new DocIDs();
>dc.setListdocs(IDs);
>   output.collect(key,dc);
> }
>   }
> 
>  public int run(String[] args) throws Exception {
>   System.out.println("Run function is called");
> JobConf conf = new JobConf(getConf(), WordCount.class);
> conf.setJobName("wordcount");
> 
> // the keys are words (strings)
> conf.setOutputKeyClass(Text.class);
> 
> conf.setOutputValueClass(Text.class);
> 
> 
> conf.setMapperClass(MapClass.class);
> 
> conf.setReducerClass(Reduce.class);
> }
> 
> 
> Now I am getting output array from the reducer as:-
> word \root\test\test123, \root\test12
> 
> In the next stage I want to stop 'stop  words',  scrub words etc. and like
> position of the word in the document. How would I apply multiple maps or
> multilevel map reduce jobs programmatically? I guess I need to make another
> class or add some functions in it? I am not able to figure it out.
> Any pointers for these type of problems?
> 
> Thanks,
> Aayush
> 
> 
> On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]> wrote:
> 
>> On Wed, 26 Mar 2008, Aayush Garg wrote:
>> 
>>> HI,
>>> I am developing the simple inverted index program frm the hadoop. My map
>>> function has the output:
>>> 
>>> and the reducer has:
>>> 
>>> 
>>> Now I want to use one more mapreduce to remove stop and scrub words from
>> Use distributed cache as Arun mentioned.
>>> this output. Also in the next stage I would like to have short summay
>> Whether to use a separate MR job depends on what exactly you mean by
>> summary. If its like a window around the current word then you can
>> possibly do it in one go.
>> Amar
>>> associated with every word. How should I design my program from this
>> stage?
>>> I mean how would I apply multiple mapreduce to this? What would be the
>>> better way to perform this?
>>> 
>>> Thanks,
>>> 
>>> Regards,
>>> -
>>> 
>>> 
>> 



Newbie and basic questions

2008-04-04 Thread Alberto Mesas
We have been reading some doc, and playing with the basic samples that come
with Hadoop 0.16.2. So let's see if we have understood everything :)

We plan to use HCore for processing our logs, but would it be possible to
use it for a case like this one?

MySQL Table with a few thousands of new rows every hour.
Every unique row must send by HTTP/GET to remote servers.
Update the row with the HTTP/GET result.
If HTTP/GET fails, retry again later this row.

We can't do twice the same unique row, so we must be sure that each is sent
once and only once, well at least until we get a valid response.


Right now  we are doing it using a simple java app that selects N rows from
the table, and uses N threads to do N HTTP/GETs.

If it's possible to implement it in Hadoop, we get easy ScaleOut and High
Availability.

ScalingUp is becoming  a little prob, and we have no real HA.

I have doubts like, how to "select unique rows" and give them to Hadoop,
etc...

Any advice is welcome :)


Re: Is it possible in Hadoop to overwrite or update a file?

2008-04-04 Thread Andrzej Bialecki

Ted Dunning wrote:


This factor of 1500 in speed seems pretty significant and is the motivation
for not supporting random read/write.

This doesn't mean that random access update should never be done, but it
does mean that scaling a design based around random access will be more
difficult than scaling a design based on sequential read and write.

On 4/3/08 12:07 PM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:


In general, if updates are relatively frequent and small compared to the
size of data then this could be useful.





Hehe ... yes, good calculations :) What I had in mind though when saying 
"relatively frequent" was rather a situation when updates are usually 
small and come at unpredictable intervals (e.g. picked from a queue 
listener) and then need to set flags on a few records. Running 
sequential update in face of such minor changes doesn't usually pay off, 
and queueing the changes so that it starts to pay off is sometimes not 
possible (takes too long to fill the batch).



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: distcp fails when copying from s3 to hdfs

2008-04-04 Thread Siddhartha Reddy
Thanks for the quick response, Tom.
I have just switched to Hadoop 0.16.2 and tried this again. Now I am getting
the following error:

Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source
s3://id:[EMAIL PROTECTED]/file.txt does not exist.


I copied the file to S3 using the following command:

bin/hadoop distcp file.txt s3://id:[EMAIL PROTECTED]/file.txt


To check that the file actually exists on S3, I tried the following
commands:

bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls
bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls

The first returned nothing, while the second returned the following:

Found 1 items
/_distcp_logs_5vzva5   1969-12-31 19:00rwxrwxrwx


And when I tried to copy it back to hdfs using the following command:

bin/hadoop distcp s3://id:[EMAIL PROTECTED]/file.txt file2.txt


I got this error:

Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source
s3://id:[EMAIL PROTECTED]/file.txt does not exist.
at org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:504)
at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:520)
at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:596)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:612)

Any pointers on why this could be happening?

Thanks,
Siddhartha

On Fri, Apr 4, 2008 at 2:13 PM, Tom White <[EMAIL PROTECTED]> wrote:

> Hi Siddhartha,
>
> This is a problem in 0.16.1
> (https://issues.apache.org/jira/browse/HADOOP-3027) that is fixed in
> 0.16.2, which was released yesterday.
>
> Tom
>
> On 04/04/2008, Siddhartha Reddy <[EMAIL PROTECTED]> wrote:
> > I am trying to run a Hadoop cluster on Amazon EC2 and backup all the
> data on
> >  Amazon S3 between the runs. I am using Hadoop 0.16.1 on a cluster made
> up of
> >  CentOS 5 images (ami-08f41161).
> >
> >
> >  I am able to copy from hdfs to S3 using the following command:
> >
> >  bin/hadoop distcp file.txt s3://id:[EMAIL PROTECTED]/file.txt
> >
> >
> >  But copying from S3 to hdfs with the following command fails:
> >
> >  bin/hadoop distcp s3://id:[EMAIL PROTECTED]/file.txt file2.txt
> >
> >
> >  with the following error:
> >
> >  With failures, global counters are inaccurate; consider running with -i
> >  Copy failed: java.lang.IllegalArgumentException: Hook previously
> registered
> > at
> >
>  java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:45)
> > at java.lang.Runtime.addShutdownHook(Runtime.java:192)
> > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1194)
> > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148)
> > at
> org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:81)
> > at
> >  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1180)
> > at org.apache.hadoop.fs.FileSystem.access$400(FileSystem.java:53)
> > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1197)
> > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148)
> > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
> > at org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:482)
> > at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:504)
> > at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:580)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> > at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:596)
> >
> >
> >  Can someone please point out if and what I am doing wrong?
> >
> >  Thanks,
> >
> > Siddhartha Reddy
> >
>



-- 
http://sids.in
"If you are not having fun, you are not doing it right."


Streaming + custom input format

2008-04-04 Thread Francesco Tamberi

Hi All,
I have a streaming tool chain written in c++/python that performs some 
operations on really big text files (gigabytes order); the chain reads files 
and writes its result to standard output.
The chain needs to read well structured files and so I need to control how 
hadoop splits files: it should splits a file only at suitable places.
What's the best way to do that?
I'm trying defining a custom input format in that way but I'm not sure it's ok:

public class MyInputFormat extends FileInputFormat {
...

public InputSplit[] getSplits(JobConf job, int numSplits) throws 
IOException {
...
}
}

That said, I tried to run that (on hadoop 0.15.3, 0.16.0, 0.16.1) with:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.16.1-streaming.jar 
-file ./allTools.sh -mapper "allTools.sh" -jobconf mapred.reduce.tasks=0 -file 
pathToMyClass.class -inputformat MyClass -input test.txt -output test-output

But it raises an exception "-inputformat : class not found : MyClass"
I tried passing a jar instead of class file, putting them in HADOOP_CLASSPATH, 
putting in system' CLASSPATH but always the same result..

Thank you for your patience!
-- Francesco





Re: distcp fails when copying from s3 to hdfs

2008-04-04 Thread Tom White
Hi Siddhartha,

This is a problem in 0.16.1
(https://issues.apache.org/jira/browse/HADOOP-3027) that is fixed in
0.16.2, which was released yesterday.

Tom

On 04/04/2008, Siddhartha Reddy <[EMAIL PROTECTED]> wrote:
> I am trying to run a Hadoop cluster on Amazon EC2 and backup all the data on
>  Amazon S3 between the runs. I am using Hadoop 0.16.1 on a cluster made up of
>  CentOS 5 images (ami-08f41161).
>
>
>  I am able to copy from hdfs to S3 using the following command:
>
>  bin/hadoop distcp file.txt s3://id:[EMAIL PROTECTED]/file.txt
>
>
>  But copying from S3 to hdfs with the following command fails:
>
>  bin/hadoop distcp s3://id:[EMAIL PROTECTED]/file.txt file2.txt
>
>
>  with the following error:
>
>  With failures, global counters are inaccurate; consider running with -i
>  Copy failed: java.lang.IllegalArgumentException: Hook previously registered
> at
>  java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:45)
> at java.lang.Runtime.addShutdownHook(Runtime.java:192)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1194)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148)
> at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:81)
> at
>  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1180)
> at org.apache.hadoop.fs.FileSystem.access$400(FileSystem.java:53)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1197)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
> at org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:482)
> at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:504)
> at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:580)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:596)
>
>
>  Can someone please point out if and what I am doing wrong?
>
>  Thanks,
>
> Siddhartha Reddy
>


Re: distcp fails :Input source not found

2008-04-04 Thread Tom White
 >   However, when I try it on 0.15.3, it doesn't allow a folder copy.
I have 100+ files in my S3 bucket, and I had to run "distcp" on each
one of them to get them on HDFS on EC2 . Not a nice experience!

This sounds like a bug - could you log a Jira issue for this please?

Thanks,
Tom


distcp fails when copying from s3 to hdfs

2008-04-04 Thread Siddhartha Reddy
I am trying to run a Hadoop cluster on Amazon EC2 and backup all the data on
Amazon S3 between the runs. I am using Hadoop 0.16.1 on a cluster made up of
CentOS 5 images (ami-08f41161).


I am able to copy from hdfs to S3 using the following command:

bin/hadoop distcp file.txt s3://id:[EMAIL PROTECTED]/file.txt


But copying from S3 to hdfs with the following command fails:

bin/hadoop distcp s3://id:[EMAIL PROTECTED]/file.txt file2.txt


with the following error:

With failures, global counters are inaccurate; consider running with -i
Copy failed: java.lang.IllegalArgumentException: Hook previously registered
at
java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:45)
at java.lang.Runtime.addShutdownHook(Runtime.java:192)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1194)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148)
at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:81)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1180)
at org.apache.hadoop.fs.FileSystem.access$400(FileSystem.java:53)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1197)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:482)
at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:504)
at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:580)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:596)


Can someone please point out if and what I am doing wrong?

Thanks,
Siddhartha Reddy