Re: Hadoop cluster build, machine specs
Greetings, It really depends on your budget. What are you looking to spend? $5k? $20k? Hadoop is about bringing the calculations to your data, so the more machines you can have, the better. In general, I'd recommend Dual-Core Opterons and 2-4 GB of RAM with an SATA hard drive. My company just ordered five such machines from Dell for Hadoop goodness, and I think the total came to around eight grand. Another alternative is Amazon EC2 and S3, of course. It all depends on what you want to do. On Fri, Apr 4, 2008 at 5:27 PM, Ted Dziuba <[EMAIL PROTECTED]> wrote: > Hi all, > > I'm looking to build a small, 5-10 node cluster to run mostly CPU-bound > Hadoop jobs. I'm shying away from the 8-core behemoth type machines for > cost reasons. But what about dual core machines? 32 or 64 bits? > > I'm still in the planning stages, so any advice would be greatly > appreciated. > > Thanks, > > Ted >
Hadoop cluster build, machine specs
Hi all, I'm looking to build a small, 5-10 node cluster to run mostly CPU-bound Hadoop jobs. I'm shying away from the 8-core behemoth type machines for cost reasons. But what about dual core machines? 32 or 64 bits? I'm still in the planning stages, so any advice would be greatly appreciated. Thanks, Ted
Re: Hadoop: Multiple map reduce or some better way
Please give your inputs for my problem. Thanks, On Sat, Apr 5, 2008 at 1:10 AM, Robert Dempsey <[EMAIL PROTECTED]> wrote: > Ted, > > It appears that Nutch hasn't been updated in a while (in Internet time at > least). Do you know if it works with the latest versions of Hadoop? Thanks. > > - Robert Dempsey (new to the list) > > > On Apr 4, 2008, at 5:36 PM, Ted Dunning wrote: > > > > > > > See Nutch. See Nutch run. > > > > http://en.wikipedia.org/wiki/Nutch > > http://lucene.apache.org/nutch/ > > > -- Aayush Garg, Phone: +41 76 482 240
Re: Hadoop: Multiple map reduce or some better way
Ted, It appears that Nutch hasn't been updated in a while (in Internet time at least). Do you know if it works with the latest versions of Hadoop? Thanks. - Robert Dempsey (new to the list) On Apr 4, 2008, at 5:36 PM, Ted Dunning wrote: See Nutch. See Nutch run. http://en.wikipedia.org/wiki/Nutch http://lucene.apache.org/nutch/
RE: secondary namenode web interface
Your configuration is good. The secondary Namenode does not publish a web interface. The "null pointer" message in the secondary Namenode log is a harmless bug but should be fixed. It would be nice if you can open a JIRA for it. Thanks, Dhruba -Original Message- From: Yuri Pradkin [mailto:[EMAIL PROTECTED] Sent: Friday, April 04, 2008 2:45 PM To: core-user@hadoop.apache.org Subject: Re: secondary namenode web interface I'm re-posting this in hope that someone would help. Thanks! On Wednesday 02 April 2008 01:29:45 pm Yuri Pradkin wrote: > Hi, > > I'm running Hadoop (latest snapshot) on several machines and in our setup > namenode and secondarynamenode are on different systems. I see from the > logs than secondary namenode regularly checkpoints fs from primary > namenode. > > But when I go to the secondary namenode HTTP (dfs.secondary.http.address) > in my browser I see something like this: > > HTTP ERROR: 500 > init > RequestURI=/dfshealth.jsp > Powered by Jetty:// > > And in secondary's log I find these lines: > > 2008-04-02 11:26:25,357 WARN /: /dfshealth.jsp: > java.lang.NullPointerException > at > org.apache.hadoop.dfs.dfshealth_jsp.(dfshealth_jsp.java:21) at > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorA cce >ssorImpl.java:57) at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCons tru >ctorAccessorImpl.java:45) at > java.lang.reflect.Constructor.newInstance(Constructor.java:539) at > java.lang.Class.newInstance0(Class.java:373) > at java.lang.Class.newInstance(Class.java:326) > at org.mortbay.jetty.servlet.Holder.newInstance(Holder.java:199) > at > org.mortbay.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:32 6) > at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:405) > at > org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH and >ler.java:475) at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567) at > org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at > org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon tex >t.java:635) at org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at > org.mortbay.http.HttpServer.service(HttpServer.java:954) at > org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at > org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981) at > org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at > org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244 ) > at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at > org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534) > > Is something missing from my configuration? Anybody else seen these? > > Thanks, > > -Yuri
Re: secondary namenode web interface
I'm re-posting this in hope that someone would help. Thanks! On Wednesday 02 April 2008 01:29:45 pm Yuri Pradkin wrote: > Hi, > > I'm running Hadoop (latest snapshot) on several machines and in our setup > namenode and secondarynamenode are on different systems. I see from the > logs than secondary namenode regularly checkpoints fs from primary > namenode. > > But when I go to the secondary namenode HTTP (dfs.secondary.http.address) > in my browser I see something like this: > > HTTP ERROR: 500 > init > RequestURI=/dfshealth.jsp > Powered by Jetty:// > > And in secondary's log I find these lines: > > 2008-04-02 11:26:25,357 WARN /: /dfshealth.jsp: > java.lang.NullPointerException > at > org.apache.hadoop.dfs.dfshealth_jsp.(dfshealth_jsp.java:21) at > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAcce >ssorImpl.java:57) at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstru >ctorAccessorImpl.java:45) at > java.lang.reflect.Constructor.newInstance(Constructor.java:539) at > java.lang.Class.newInstance0(Class.java:373) > at java.lang.Class.newInstance(Class.java:326) > at org.mortbay.jetty.servlet.Holder.newInstance(Holder.java:199) > at > org.mortbay.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:326) > at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:405) > at > org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHand >ler.java:475) at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567) at > org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at > org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContex >t.java:635) at org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at > org.mortbay.http.HttpServer.service(HttpServer.java:954) at > org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at > org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981) at > org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at > org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244) > at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at > org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534) > > Is something missing from my configuration? Anybody else seen these? > > Thanks, > > -Yuri
Re: Streaming + custom input format
It does work for me. I have to BOTH ship the extra jar using -file AND include in classpath on local system (via setting HADOOP_CLASSPATH). I'm not sure what "nothing happened" means. BTW, I'm using the 0.16.2 release. On Friday 04 April 2008 10:19:54 am Francesco Tamberi wrote: > I already tried that... nothing happened... > Thank you, > -- Francesco > > Ted Dunning ha scritto: > > I saw that, but I don't know if it will put a jar into the classpath at > > the other end.
Re: Hadoop: Multiple map reduce or some better way
See Nutch. See Nutch run. http://en.wikipedia.org/wiki/Nutch http://lucene.apache.org/nutch/ On 4/4/08 1:22 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote: > Hi, > > I have not used lucene index ever before. I do not get how we build it with > hadoop Map reduce. Basically what I was looking for like how to implement > multilevel map/reduce for my mentioned problem. > > > On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <[EMAIL PROTECTED]> wrote: > >> You can build Lucene indexes using Hadoop Map/Reduce. See the index >> contrib package in the trunk. Or is it still not something you are >> looking for? >> >> Regards, >> Ning >> >> On 4/4/08, Aayush Garg <[EMAIL PROTECTED]> wrote: >>> No, currently my requirement is to solve this problem by apache hadoop. >> I am >>> trying to build up this type of inverted index and then measure >> performance >>> criteria with respect to others. >>> >>> Thanks, >>> >>> >>> On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >>> Are you implementing this for instruction or production? If production, why not use Lucene? On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote: > HI Amar , Theodore, Arun, > > Thanks for your reply. Actaully I am new to hadoop so cant figure >> out much. > I have written following code for inverted index. This code maps >> each word > from the document to its document id. > ex: apple file1 file123 > Main functions of the code are:- > > public class HadoopProgram extends Configured implements Tool { > public static class MapClass extends MapReduceBase > implements Mapper { > > private final static IntWritable one = new IntWritable(1); > private Text word = new Text(); > private Text doc = new Text(); > private long numRecords=0; > private String inputFile; > >public void configure(JobConf job){ > System.out.println("Configure function is called"); > inputFile = job.get("map.input.file"); > System.out.println("In conf the input file is"+inputFile); > } > > > public void map(LongWritable key, Text value, > OutputCollector output, > Reporter reporter) throws IOException { > String line = value.toString(); > StringTokenizer itr = new StringTokenizer(line); > doc.set(inputFile); > while (itr.hasMoreTokens()) { > word.set(itr.nextToken()); > output.collect(word,doc); > } > if(++numRecords%4==0){ >System.out.println("Finished processing of input file"+inputFile); > } > } > } > > /** >* A reducer class that just emits the sum of the input values. >*/ > public static class Reduce extends MapReduceBase > implements Reducer { > > // This works as K2, V2, K3, V3 > public void reduce(Text key, Iterator values, >OutputCollector output, >Reporter reporter) throws IOException { > int sum = 0; > Text dummy = new Text(); > ArrayList IDs = new ArrayList(); > String str; > > while (values.hasNext()) { > dummy = values.next(); > str = dummy.toString(); > IDs.add(str); >} >DocIDs dc = new DocIDs(); >dc.setListdocs(IDs); > output.collect(key,dc); > } > } > > public int run(String[] args) throws Exception { > System.out.println("Run function is called"); > JobConf conf = new JobConf(getConf(), WordCount.class); > conf.setJobName("wordcount"); > > // the keys are words (strings) > conf.setOutputKeyClass(Text.class); > > conf.setOutputValueClass(Text.class); > > > conf.setMapperClass(MapClass.class); > > conf.setReducerClass(Reduce.class); > } > > > Now I am getting output array from the reducer as:- > word \root\test\test123, \root\test12 > > In the next stage I want to stop 'stop words', scrub words etc. >> and like > position of the word in the document. How would I apply multiple >> maps or > multilevel map reduce jobs programmatically? I guess I need to make another > class or add some functions in it? I am not able to figure it out. > Any pointers for these type of problems? > > Thanks, > Aayush > > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]> wrote: > >> On Wed, 26 Mar 2008, Aayush Garg wrote: >> >>> HI, >>> I am developing the simple inverted index program frm the hadoop. >> My map >>> function has the output: >>> >>> and the reducer ha
Re: on number of input files and split size
So it seems best for my application if I can somehow consolidate smaller files into a couple of large files. All of my files reside on S3, and I am using 'distcp' command to copy them to hdfs on EC2 before running a MR job. I was thinking it would be nice if I could modify distcp such that each EC2 image running 'distcp' on the EC2 cluster will concatenate input files into single file, so that at the end of the copy process , we will have as many files as there are machines in the cluster. Any thoughts if how I should proceeed on this ? or if this is a good idea at all ? Ted Dunning <[EMAIL PROTECTED]> wrote: The split will depend entirely on the input format that you use and the files that you have. In your case, you have lots of very small files so the limiting factor will almost certainly be the number of files. Thus, you will have 1000 splits (one per file). Your performance, btw, will likely be pretty poor with so many small files. Can you consolidate them? 100MB of data should probably be in no more than a few files if you want good performance. At that, most kinds of processing will be completely dominated by job startup time. If your jobs are I/O bound, they will be able to read 100MB of data in a just a few seconds at most. Startup time for a hadoop job is typically 10 seconds or more. On 4/4/08 12:58 PM, "Prasan Ary" wrote: > I have a question on how input files are split before they are given out to > Map functions. > Say I have an input directory containing 1000 files whose total size is 100 > MB, and I have 10 machines in my cluster and I have configured 10 > mapred.map.tasks in hadoop-site.xml. > > 1. With this configuration, do we have a way to know what size each split > will be of? > 2. Does split size depend on how many files there are in the input > directory? What if I have only 10 files in input directory, but the total size > of all these files is still 100 MB? Will it affect split size? > > Thanks. > > > - > You rock. That's why Blockbuster's offering you one month of Blockbuster Total > Access, No Cost. - You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost.
S3 Exception
Is there any additional configuration needed to run against S3 besides these instructions? http://wiki.apache.org/hadoop/AmazonS3 Following the instructions on that page, when I try to run "start- dfs.sh" I see the following exception in the logs: 2008-04-04 17:03:31,345 ERROR org.apache.hadoop.dfs.NameNode: java.lang.IllegalArgumentException: port out of range:-1 at java.net.InetSocketAddress.(InetSocketAddress.java:118) at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:125) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:119) at org.apache.hadoop.dfs.NameNode.(NameNode.java:176) at org.apache.hadoop.dfs.NameNode.(NameNode.java:162) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855) The node fails to start and exits after this exception is thrown. The bucket name I am using is of the format 'myuniqueprefix-hadoop'. I am using Hadoop 0.16.2 on OSX. I didn't see any Jira issues for this or any resolution in previous email threads. Is there a workaround for this, or have I missed some necessary step in configuration? Thanks, Craig B
Re: Hadoop: Multiple map reduce or some better way
Hi, I have not used lucene index ever before. I do not get how we build it with hadoop Map reduce. Basically what I was looking for like how to implement multilevel map/reduce for my mentioned problem. On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <[EMAIL PROTECTED]> wrote: > You can build Lucene indexes using Hadoop Map/Reduce. See the index > contrib package in the trunk. Or is it still not something you are > looking for? > > Regards, > Ning > > On 4/4/08, Aayush Garg <[EMAIL PROTECTED]> wrote: > > No, currently my requirement is to solve this problem by apache hadoop. > I am > > trying to build up this type of inverted index and then measure > performance > > criteria with respect to others. > > > > Thanks, > > > > > > On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > > > > > > Are you implementing this for instruction or production? > > > > > > If production, why not use Lucene? > > > > > > > > > On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote: > > > > > > > HI Amar , Theodore, Arun, > > > > > > > > Thanks for your reply. Actaully I am new to hadoop so cant figure > out > > > much. > > > > I have written following code for inverted index. This code maps > each > > > word > > > > from the document to its document id. > > > > ex: apple file1 file123 > > > > Main functions of the code are:- > > > > > > > > public class HadoopProgram extends Configured implements Tool { > > > > public static class MapClass extends MapReduceBase > > > > implements Mapper { > > > > > > > > private final static IntWritable one = new IntWritable(1); > > > > private Text word = new Text(); > > > > private Text doc = new Text(); > > > > private long numRecords=0; > > > > private String inputFile; > > > > > > > >public void configure(JobConf job){ > > > > System.out.println("Configure function is called"); > > > > inputFile = job.get("map.input.file"); > > > > System.out.println("In conf the input file is"+inputFile); > > > > } > > > > > > > > > > > > public void map(LongWritable key, Text value, > > > > OutputCollector output, > > > > Reporter reporter) throws IOException { > > > > String line = value.toString(); > > > > StringTokenizer itr = new StringTokenizer(line); > > > > doc.set(inputFile); > > > > while (itr.hasMoreTokens()) { > > > > word.set(itr.nextToken()); > > > > output.collect(word,doc); > > > > } > > > > if(++numRecords%4==0){ > > > >System.out.println("Finished processing of input > > > file"+inputFile); > > > > } > > > > } > > > > } > > > > > > > > /** > > > >* A reducer class that just emits the sum of the input values. > > > >*/ > > > > public static class Reduce extends MapReduceBase > > > > implements Reducer { > > > > > > > > // This works as K2, V2, K3, V3 > > > > public void reduce(Text key, Iterator values, > > > >OutputCollector output, > > > >Reporter reporter) throws IOException { > > > > int sum = 0; > > > > Text dummy = new Text(); > > > > ArrayList IDs = new ArrayList(); > > > > String str; > > > > > > > > while (values.hasNext()) { > > > > dummy = values.next(); > > > > str = dummy.toString(); > > > > IDs.add(str); > > > >} > > > >DocIDs dc = new DocIDs(); > > > >dc.setListdocs(IDs); > > > > output.collect(key,dc); > > > > } > > > > } > > > > > > > > public int run(String[] args) throws Exception { > > > > System.out.println("Run function is called"); > > > > JobConf conf = new JobConf(getConf(), WordCount.class); > > > > conf.setJobName("wordcount"); > > > > > > > > // the keys are words (strings) > > > > conf.setOutputKeyClass(Text.class); > > > > > > > > conf.setOutputValueClass(Text.class); > > > > > > > > > > > > conf.setMapperClass(MapClass.class); > > > > > > > > conf.setReducerClass(Reduce.class); > > > > } > > > > > > > > > > > > Now I am getting output array from the reducer as:- > > > > word \root\test\test123, \root\test12 > > > > > > > > In the next stage I want to stop 'stop words', scrub words etc. > and > > > like > > > > position of the word in the document. How would I apply multiple > maps or > > > > multilevel map reduce jobs programmatically? I guess I need to make > > > another > > > > class or add some functions in it? I am not able to figure it out. > > > > Any pointers for these type of problems? > > > > > > > > Thanks, > > > > Aayush > > > > > > > > > > > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]> > > > wrote: > > > > > > > >> On Wed, 26 Mar 2008, Aayush Garg wrote: > > > >> > > > >>> HI, > > > >>> I am developing the simple inverted index program frm the hadoop. > My > > > map > > > >>> function has the output: > > > >>> > > > >>> and the reducer has
Re: more noob questions--how/when is data 'distributed' across a cluster?
I should add that systems like Pig and JAQL aim to satisfy your needs very nicely. They may or may not be ready for your needs, but they aren't terribly far away. Also, you should consider whether it is better for you to have a system that is considered "industry standard" (aka fully relational) or "somewhat experimental or avante garde". Different situations could force one answer or the other. On 4/4/08 11:48 AM, "Paul Danese" <[EMAIL PROTECTED]> wrote: > Hi, > > Currently I have a large (for me) amount of data stored in a relational > database (3 tables: each with 2 - 10 million related records. This is an > oversimplification, but for clarity it's close enough). > > There is a relatively simple Object-relational Mapping (ORM) to my > database: Specifically, my parent Object is called "Accident". > Accidents can have 1 or more Report objects (has many). > Reports can have 1 or more Outcome objects (has many). > > Each of these Objects maps to a specific table in my RDBMS w/ foreign keys > 'connecting' records between tables. > > I run searches against this database (using Lucene) and this works quite > well as long as I return only *subsets* of the *total* result-set at any one > time. > e.g. I may have 25,000 hits ("Accidents") that meet my threshold Lucene > score, but as long as I only query the database for 50 Accident "objects" at > any one time, the response time is great. > > The 'problem' is that I'd also like to use those 25,000 Accidents to > generate an electronic report as **quickly as possible** > (right now it takes about 30 minutes to collect all 25,000 hits from the > database, extract the relevant fields and construct the actual report). > Most of this 30 minutes is spent hitting the database and > processing/extracting the relevant data (generating the report is rather > fast once all the data are properly formatted). > > So...at my naive level, this seems like a decent job for hadoop. > ***QUESTION 1: Is this an accurate belief?*** > > i.e., I have a semi-large collection of key/value pairs (25,000 Accident IDs > would be the keys, and 25,000 Accident objects would be values) > > These object/value pairs are "mapped" on a cluster, extracting the relevant > data from each object. > The mapping then releases a new set of "key/value" pairs (in this case, > emitted keys are one of three categories (accident, report, outcome) and the > values are arrays of accident, report and outcome data that will go into the > report). > > These emitted key/value pairs are then "reduced" and resulting reduced > collections are used to build the report. > > ***QUESTION 2: If the answer to Q1 is "yes", how does one typically "move" > data from a rdbms to something like HDFS/HBase?*** > ***QUESTION 3: Am I right in thinking that my HBase data are going to be > denormalized relative to my RDBMS?*** > ***QUESTION 4: How are the data within an HBase database *actually* > distributed amongst nodes? i.e. is the distribution done automatically upon > creating the db (as long as the cluster already exists?) Or do you have to > issue some type of command that says "okay...here's the HBase db, distribute > it to nodes a - z"*** > ***QUESTION 5: Or is this whole problem something better addressed by some > type of high-performance rdbms cluster?*** > ***QUESTION 6: Is there a detailed (step by step) tutorial on how to use > HBase w/ Hadoop?*** > > > Anyway, apologies if this is the 1000th time this has been answered and > thank you for any insight!
Re: more noob questions--how/when is data 'distributed' across a cluster?
On 4/4/08 11:48 AM, "Paul Danese" <[EMAIL PROTECTED]> wrote: > [ ... Extract and report on 25,000 out of 10^6 records ...] > > So...at my naive level, this seems like a decent job for hadoop. > ***QUESTION 1: Is this an accurate belief?*** Sounds just right. On 10 loser machines, it is feasible to expect to be able to scan 100MB per second so if your records (with associated data) is about 1GB, it should take about 10 seconds to pass over your data. With allowances for system reality you should be able to do something interesting in a minute or so. > [ ... Map does projection and object tagging, reduce does reporting ...]]] Your process outline looks fine. > > ***QUESTION 2: If the answer to Q1 is "yes", how does one typically "move" > data from a rdbms to something like HDFS/HBase?*** There are copy commands. Typically, you partition your data by date of insertion and dump new records into new files. Occasionally, you might merge older records to limit the number of total files. I don't see any need for Hbase here, but since the two things that Hbase does really well are insertion and table scan, it would be pretty natural to use. You should check with the hbase guys to see how long it will take to produce the output you want, but I would expect that you could get your 25000 rows faster than a raw scan using hadoop alone. > ***QUESTION 3: Am I right in thinking that my HBase data are going to be > denormalized relative to my RDBMS?*** Very likely. > ***QUESTION 5: Or is this whole problem something better addressed by some > type of high-performance rdbms cluster?*** This would be a very modest sized Oracle data set, but I would guess that the costs would make Hadoop preferable. It isn't even all that large for mySQL. The existence of very nice reporting software for either of these could tip the balance the other way, however. > ***QUESTION 6: Is there a detailed (step by step) tutorial on how to use > HBase w/ Hadoop?*** I think you could start with hadoop alone.
Re: on number of input files and split size
The split will depend entirely on the input format that you use and the files that you have. In your case, you have lots of very small files so the limiting factor will almost certainly be the number of files. Thus, you will have 1000 splits (one per file). Your performance, btw, will likely be pretty poor with so many small files. Can you consolidate them? 100MB of data should probably be in no more than a few files if you want good performance. At that, most kinds of processing will be completely dominated by job startup time. If your jobs are I/O bound, they will be able to read 100MB of data in a just a few seconds at most. Startup time for a hadoop job is typically 10 seconds or more. On 4/4/08 12:58 PM, "Prasan Ary" <[EMAIL PROTECTED]> wrote: > I have a question on how input files are split before they are given out to > Map functions. > Say I have an input directory containing 1000 files whose total size is 100 > MB, and I have 10 machines in my cluster and I have configured 10 > mapred.map.tasks in hadoop-site.xml. > > 1. With this configuration, do we have a way to know what size each split > will be of? > 2. Does split size depend on how many files there are in the input > directory? What if I have only 10 files in input directory, but the total size > of all these files is still 100 MB? Will it affect split size? > > Thanks. > > > - > You rock. That's why Blockbuster's offering you one month of Blockbuster Total > Access, No Cost.
Re: distcp fails when copying from s3 to hdfs
Your distcp command looks correct. distcp may have created some log files (e.g. inside /_distcp_logs_5vzva5 from your previous email.) Could you check the logs, see whether there are error messages? If you could send me the distcp output and the logs, I may be able to find out the problem. (remember to remove the id:secret :) Nicholas - Original Message From: Siddhartha Reddy <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Friday, April 4, 2008 12:23:17 PM Subject: Re: distcp fails when copying from s3 to hdfs I am sorry, that was a mistype in my mail. The second command was (please note the / at the end): bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls / I guess you are right, Nicholas. The s3://id:[EMAIL PROTECTED]/file.txtindeed does not seem to be there. But the earlier distcp command to copy the file to S3 finished without errors. Once again, the command I am using to copy the file to S3 is: bin/hadoop distcp file.txt s3://id:[EMAIL PROTECTED]/file.txt Am I doing anything wrong here? Thanks, Siddhartha On Fri, Apr 4, 2008 at 11:38 PM, <[EMAIL PROTECTED]> wrote: > >To check that the file actually exists on S3, I tried the following > commands: > > > >bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls > >bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls > > > >The first returned nothing, while the second returned the following: > > > >Found 1 items > >/_distcp_logs_5vzva5 1969-12-31 19:00rwxrwxrwx > > > Are the first and the second commands the same? (why they return > different results?) It seems that distcp is right: > s3://id:[EMAIL PROTECTED]/file.txt indeed does not exist from the output > of your the second command. > > Nicholas > > -- http://sids.in "If you are not having fun, you are not doing it right."
on number of input files and split size
I have a question on how input files are split before they are given out to Map functions. Say I have an input directory containing 1000 files whose total size is 100 MB, and I have 10 machines in my cluster and I have configured 10 mapred.map.tasks in hadoop-site.xml. 1. With this configuration, do we have a way to know what size each split will be of? 2. Does split size depend on how many files there are in the input directory? What if I have only 10 files in input directory, but the total size of all these files is still 100 MB? Will it affect split size? Thanks. - You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost.
Re: Error msg: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
Thanks a lot for help. Only for register I would like to post which enviroment variables I set: export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun-1.5.0.13 export OS_NAME=linux export OS_ARCH=i386 export LIBHDFS_BUILD_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/libhdfs export SHLIB_VERSION=1 export HADOOP_HOME=/mnt/hd1/hadoop/hadoop-0.14.4 export HADOOP_CONF_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/conf export HADOOP_LOG_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/logs export LD_LIBRARY_PATH=/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/jre/lib/i386/server:/mnt/hd1/hadoop/hadoop-0.14.4/libhdfs export CLASSPATH=/mnt/hd1/hadoop/hadoop-0.14.4/hadoop-0.14.4-core.jar:/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/jre/lib/i386:/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/jre/lib/i386/server:/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/jre/lib/i386/client:/usr/share/java:/usr/share/java/commons-logging-api.jar:/usr/share/java/:/usr/share/java/commons-logging.jar:/usr/share/java/log4j-1.2.jar export HADOOP_CLASSPATH=/mnt/hd1/hadoop/hadoop-0.14.4/hadoop-0.14.4-core.jar:/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/jre/lib/i386:/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/jre/lib/i386/server:/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/jre/lib/i386/client:/usr/share/java:/usr/share/java/commons-logging-api.jar:/usr/share/java/:/usr/share/java/commons-logging.jar:/usr/share/java/log4j-1.2.jar See you, Anisio On Thu, Apr 3, 2008 at 8:51 AM, Peeyush Bishnoi <[EMAIL PROTECTED]> wrote: > Hello , > > As this problem is related to CLASSPATh of hadoop , so just set the > HADOOP_CLASSPATH or CLASSPATH with hadoop core jar > > --- > Peeyush > > > On Wed, 2008-04-02 at 13:51 -0300, Anisio Mendes Lacerda wrote: > > > Hi, > > > > me and my coleagues are implementing a small search engine in my > University > > Laboratory, > > and we would like to use Hadoop as file system. > > > > For now we are having troubles in running the following simple code > example: > > > > > > #include "hdfs.h" > > int main(int argc, char **argv) { > > hdfsFS fs = hdfsConnect("apolo.latin.dcc.ufmg.br", 51070); > > if(!fs) { > > fprintf(stderr, "Oops! Failed to connect to hdfs!\n"); > > exit(-1); > > } > > int result = hdfsDisconnect(fs); > > if(!result) { > > fprintf(stderr, "Oops! Failed to connect to hdfs!\n"); > > exit(-1); > > } > > } > > > > > > The error msg is: > > > > Exception in thread "main" java.lang.NoClassDefFoundError: > > org/apache/hadoop/conf/Configuration > > > > > > > > We configured the following enviroment variables: > > > > export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun-1.5.0.13 > > export OS_NAME=linux > > export OS_ARCH=i386 > > export LIBHDFS_BUILD_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/libhdfs > > export SHLIB_VERSION=1 > > > > export HADOOP_HOME=/mnt/hd1/hadoop/hadoop-0.14.4 > > export HADOOP_CONF_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/conf > > export HADOOP_LOG_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/logs > > > > The following commands were used to compile the codes: > > > > In directory: hadoop-0.14.4/src/c++/libhdfs > > > > 1 - make all > > 2 - gcc -I/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/include/ -c > my_hdfs_test.c > > 3 - gcc my_hdfs_test.o -I/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/include/ > > -L/mnt/hd1/hadoop/hadoop-0.14.4/libhdfs -lhdfs > > -L/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/jre/lib/i386/server -ljvm -o > > my_hdfs_test > > > > Obs: > > > > The hadoop file system seems to be ok once we can run commands like > this: > > > > [EMAIL PROTECTED]:/mnt/hd1/hadoop/hadoop-0.14.4$ bin/hadoop dfs -ls > > Found 0 items > > > > > > > > > > > > > -- []s Anisio Mendes Lacerda
Re: distcp fails when copying from s3 to hdfs
I am sorry, that was a mistype in my mail. The second command was (please note the / at the end): bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls / I guess you are right, Nicholas. The s3://id:[EMAIL PROTECTED]/file.txtindeed does not seem to be there. But the earlier distcp command to copy the file to S3 finished without errors. Once again, the command I am using to copy the file to S3 is: bin/hadoop distcp file.txt s3://id:[EMAIL PROTECTED]/file.txt Am I doing anything wrong here? Thanks, Siddhartha On Fri, Apr 4, 2008 at 11:38 PM, <[EMAIL PROTECTED]> wrote: > >To check that the file actually exists on S3, I tried the following > commands: > > > >bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls > >bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls > > > >The first returned nothing, while the second returned the following: > > > >Found 1 items > >/_distcp_logs_5vzva5 1969-12-31 19:00rwxrwxrwx > > > Are the first and the second commands the same? (why they return > different results?) It seems that distcp is right: > s3://id:[EMAIL PROTECTED]/file.txt indeed does not exist from the output > of your the second command. > > Nicholas > > -- http://sids.in "If you are not having fun, you are not doing it right."
more noob questions--how/when is data 'distributed' across a cluster?
Hi, Currently I have a large (for me) amount of data stored in a relational database (3 tables: each with 2 - 10 million related records. This is an oversimplification, but for clarity it's close enough). There is a relatively simple Object-relational Mapping (ORM) to my database: Specifically, my parent Object is called "Accident". Accidents can have 1 or more Report objects (has many). Reports can have 1 or more Outcome objects (has many). Each of these Objects maps to a specific table in my RDBMS w/ foreign keys 'connecting' records between tables. I run searches against this database (using Lucene) and this works quite well as long as I return only *subsets* of the *total* result-set at any one time. e.g. I may have 25,000 hits ("Accidents") that meet my threshold Lucene score, but as long as I only query the database for 50 Accident "objects" at any one time, the response time is great. The 'problem' is that I'd also like to use those 25,000 Accidents to generate an electronic report as **quickly as possible** (right now it takes about 30 minutes to collect all 25,000 hits from the database, extract the relevant fields and construct the actual report). Most of this 30 minutes is spent hitting the database and processing/extracting the relevant data (generating the report is rather fast once all the data are properly formatted). So...at my naive level, this seems like a decent job for hadoop. ***QUESTION 1: Is this an accurate belief?*** i.e., I have a semi-large collection of key/value pairs (25,000 Accident IDs would be the keys, and 25,000 Accident objects would be values) These object/value pairs are "mapped" on a cluster, extracting the relevant data from each object. The mapping then releases a new set of "key/value" pairs (in this case, emitted keys are one of three categories (accident, report, outcome) and the values are arrays of accident, report and outcome data that will go into the report). These emitted key/value pairs are then "reduced" and resulting reduced collections are used to build the report. ***QUESTION 2: If the answer to Q1 is "yes", how does one typically "move" data from a rdbms to something like HDFS/HBase?*** ***QUESTION 3: Am I right in thinking that my HBase data are going to be denormalized relative to my RDBMS?*** ***QUESTION 4: How are the data within an HBase database *actually* distributed amongst nodes? i.e. is the distribution done automatically upon creating the db (as long as the cluster already exists?) Or do you have to issue some type of command that says "okay...here's the HBase db, distribute it to nodes a - z"*** ***QUESTION 5: Or is this whole problem something better addressed by some type of high-performance rdbms cluster?*** ***QUESTION 6: Is there a detailed (step by step) tutorial on how to use HBase w/ Hadoop?*** Anyway, apologies if this is the 1000th time this has been answered and thank you for any insight!
Re: distcp fails when copying from s3 to hdfs
>To check that the file actually exists on S3, I tried the following commands: > >bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls >bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls > >The first returned nothing, while the second returned the following: > >Found 1 items >/_distcp_logs_5vzva5 1969-12-31 19:00rwxrwxrwx Are the first and the second commands the same? (why they return different results?) It seems that distcp is right: s3://id:[EMAIL PROTECTED]/file.txt indeed does not exist from the output of your the second command. Nicholas
Re: Streaming + custom input format
On 4/4/08 10:18 AM, "Francesco Tamberi" <[EMAIL PROTECTED]> wrote: > Thank for your fast reply! > > Ted Dunning ha scritto: >> Take a looks at the way that the text input format moves to the next line >> after a split point. >> >> > I'm not sure to understand.. is my way correct or are you suggesting > another one? I am not sure if I was suggesting something different, but I think it was. It sounded like you were going to find good split points in the getSplits method. The TextInputFormat doesn't do try to be so clever since that would involve (serialized) reading of parts of the file. Instead, it picks the break points *WITHOUT* reference to the contents of the files. Then, when the mapper lights up, the input format jumps to the assigned point, reads the remainder of the line at that point and then starts sending full lines to the mapper, continuing until it hits the end of the file OR passes the beginning of the next split. This means that it may read additional data after the assigned end point, but that extra data is guaranteed to be ignored by the input format in charge of reading that split. This is a very clever and simple solution to the problem that depends only on being able to find a boundary between records. If you can do that, then you are golden.
Re: Hadoop: Multiple map reduce or some better way
You can build Lucene indexes using Hadoop Map/Reduce. See the index contrib package in the trunk. Or is it still not something you are looking for? Regards, Ning On 4/4/08, Aayush Garg <[EMAIL PROTECTED]> wrote: > No, currently my requirement is to solve this problem by apache hadoop. I am > trying to build up this type of inverted index and then measure performance > criteria with respect to others. > > Thanks, > > > On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > > > Are you implementing this for instruction or production? > > > > If production, why not use Lucene? > > > > > > On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote: > > > > > HI Amar , Theodore, Arun, > > > > > > Thanks for your reply. Actaully I am new to hadoop so cant figure out > > much. > > > I have written following code for inverted index. This code maps each > > word > > > from the document to its document id. > > > ex: apple file1 file123 > > > Main functions of the code are:- > > > > > > public class HadoopProgram extends Configured implements Tool { > > > public static class MapClass extends MapReduceBase > > > implements Mapper { > > > > > > private final static IntWritable one = new IntWritable(1); > > > private Text word = new Text(); > > > private Text doc = new Text(); > > > private long numRecords=0; > > > private String inputFile; > > > > > >public void configure(JobConf job){ > > > System.out.println("Configure function is called"); > > > inputFile = job.get("map.input.file"); > > > System.out.println("In conf the input file is"+inputFile); > > > } > > > > > > > > > public void map(LongWritable key, Text value, > > > OutputCollector output, > > > Reporter reporter) throws IOException { > > > String line = value.toString(); > > > StringTokenizer itr = new StringTokenizer(line); > > > doc.set(inputFile); > > > while (itr.hasMoreTokens()) { > > > word.set(itr.nextToken()); > > > output.collect(word,doc); > > > } > > > if(++numRecords%4==0){ > > >System.out.println("Finished processing of input > > file"+inputFile); > > > } > > > } > > > } > > > > > > /** > > >* A reducer class that just emits the sum of the input values. > > >*/ > > > public static class Reduce extends MapReduceBase > > > implements Reducer { > > > > > > // This works as K2, V2, K3, V3 > > > public void reduce(Text key, Iterator values, > > >OutputCollector output, > > >Reporter reporter) throws IOException { > > > int sum = 0; > > > Text dummy = new Text(); > > > ArrayList IDs = new ArrayList(); > > > String str; > > > > > > while (values.hasNext()) { > > > dummy = values.next(); > > > str = dummy.toString(); > > > IDs.add(str); > > >} > > >DocIDs dc = new DocIDs(); > > >dc.setListdocs(IDs); > > > output.collect(key,dc); > > > } > > > } > > > > > > public int run(String[] args) throws Exception { > > > System.out.println("Run function is called"); > > > JobConf conf = new JobConf(getConf(), WordCount.class); > > > conf.setJobName("wordcount"); > > > > > > // the keys are words (strings) > > > conf.setOutputKeyClass(Text.class); > > > > > > conf.setOutputValueClass(Text.class); > > > > > > > > > conf.setMapperClass(MapClass.class); > > > > > > conf.setReducerClass(Reduce.class); > > > } > > > > > > > > > Now I am getting output array from the reducer as:- > > > word \root\test\test123, \root\test12 > > > > > > In the next stage I want to stop 'stop words', scrub words etc. and > > like > > > position of the word in the document. How would I apply multiple maps or > > > multilevel map reduce jobs programmatically? I guess I need to make > > another > > > class or add some functions in it? I am not able to figure it out. > > > Any pointers for these type of problems? > > > > > > Thanks, > > > Aayush > > > > > > > > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]> > > wrote: > > > > > >> On Wed, 26 Mar 2008, Aayush Garg wrote: > > >> > > >>> HI, > > >>> I am developing the simple inverted index program frm the hadoop. My > > map > > >>> function has the output: > > >>> > > >>> and the reducer has: > > >>> > > >>> > > >>> Now I want to use one more mapreduce to remove stop and scrub words > > from > > >> Use distributed cache as Arun mentioned. > > >>> this output. Also in the next stage I would like to have short summay > > >> Whether to use a separate MR job depends on what exactly you mean by > > >> summary. If its like a window around the current word then you can > > >> possibly do it in one go. > > >> Amar > > >>> associated with every word. How should I design my program from this > > >> stage? > > >>> I mean how would I apply multiple map
Re: Streaming + custom input format
I already tried that... nothing happened... Thank you, -- Francesco Ted Dunning ha scritto: I saw that, but I don't know if it will put a jar into the classpath at the other end. On 4/4/08 9:56 AM, "Yuri Pradkin" <[EMAIL PROTECTED]> wrote: There is a -file option to streaming that -file File/dir to be shipped in the Job jar file On Friday 04 April 2008 09:24:59 am Ted Dunning wrote: At one point, it was necessary to unpack the streaming.jar file and put your own classes and jars into that. Last time I looked at the code, however, there was support for that happening magically, but in the 30 seconds I have allotted to help you (sorry bout that), I can't see that there is a command line option to trigger that, unless it is the one for including a file in the jar file.
Re: Streaming + custom input format
Thank for your fast reply! Ted Dunning ha scritto: Take a looks at the way that the text input format moves to the next line after a split point. I'm not sure to understand.. is my way correct or are you suggesting another one? There are a couple of possible problems with your input format not found problem. First, is your input in a package? If so, you need to provide a complete name for the class. I forget to explain but I provided the complete class name (i.e. org.myName.ClassName) Secondly, you have to give streaming information about how to package up your input format class for transfer to the cluster. Having access to the class on your initial invoking machine is not sufficient. At one point, it was necessary to unpack the streaming.jar file and put your own classes and jars into that. Last time I looked at the code, however, there was support for that happening magically, but in the 30 seconds I have allotted to help you (sorry bout that), I can't see that there is a command line option to trigger that, unless it is the one for including a file in the jar file. I'll try to include my jar/class in streaming.jar.. is no SO clean but it would be great if it works! I'll keep you informed ;) Thank you again On 4/4/08 3:00 AM, "Francesco Tamberi" <[EMAIL PROTECTED]> wrote: Hi All, I have a streaming tool chain written in c++/python that performs some operations on really big text files (gigabytes order); the chain reads files and writes its result to standard output. The chain needs to read well structured files and so I need to control how hadoop splits files: it should splits a file only at suitable places. What's the best way to do that? I'm trying defining a custom input format in that way but I'm not sure it's ok: public class MyInputFormat extends FileInputFormat { ... public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException { ... } } That said, I tried to run that (on hadoop 0.15.3, 0.16.0, 0.16.1) with: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.16.1-streaming.jar -file ./allTools.sh -mapper "allTools.sh" -jobconf mapred.reduce.tasks=0 -file pathToMyClass.class -inputformat MyClass -input test.txt -output test-output But it raises an exception "-inputformat : class not found : MyClass" I tried passing a jar instead of class file, putting them in HADOOP_CLASSPATH, putting in system' CLASSPATH but always the same result.. Thank you for your patience! -- Francesco
Re: Hadoop: Multiple map reduce or some better way
No, currently my requirement is to solve this problem by apache hadoop. I am trying to build up this type of inverted index and then measure performance criteria with respect to others. Thanks, On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > Are you implementing this for instruction or production? > > If production, why not use Lucene? > > > On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote: > > > HI Amar , Theodore, Arun, > > > > Thanks for your reply. Actaully I am new to hadoop so cant figure out > much. > > I have written following code for inverted index. This code maps each > word > > from the document to its document id. > > ex: apple file1 file123 > > Main functions of the code are:- > > > > public class HadoopProgram extends Configured implements Tool { > > public static class MapClass extends MapReduceBase > > implements Mapper { > > > > private final static IntWritable one = new IntWritable(1); > > private Text word = new Text(); > > private Text doc = new Text(); > > private long numRecords=0; > > private String inputFile; > > > >public void configure(JobConf job){ > > System.out.println("Configure function is called"); > > inputFile = job.get("map.input.file"); > > System.out.println("In conf the input file is"+inputFile); > > } > > > > > > public void map(LongWritable key, Text value, > > OutputCollector output, > > Reporter reporter) throws IOException { > > String line = value.toString(); > > StringTokenizer itr = new StringTokenizer(line); > > doc.set(inputFile); > > while (itr.hasMoreTokens()) { > > word.set(itr.nextToken()); > > output.collect(word,doc); > > } > > if(++numRecords%4==0){ > >System.out.println("Finished processing of input > file"+inputFile); > > } > > } > > } > > > > /** > >* A reducer class that just emits the sum of the input values. > >*/ > > public static class Reduce extends MapReduceBase > > implements Reducer { > > > > // This works as K2, V2, K3, V3 > > public void reduce(Text key, Iterator values, > >OutputCollector output, > >Reporter reporter) throws IOException { > > int sum = 0; > > Text dummy = new Text(); > > ArrayList IDs = new ArrayList(); > > String str; > > > > while (values.hasNext()) { > > dummy = values.next(); > > str = dummy.toString(); > > IDs.add(str); > >} > >DocIDs dc = new DocIDs(); > >dc.setListdocs(IDs); > > output.collect(key,dc); > > } > > } > > > > public int run(String[] args) throws Exception { > > System.out.println("Run function is called"); > > JobConf conf = new JobConf(getConf(), WordCount.class); > > conf.setJobName("wordcount"); > > > > // the keys are words (strings) > > conf.setOutputKeyClass(Text.class); > > > > conf.setOutputValueClass(Text.class); > > > > > > conf.setMapperClass(MapClass.class); > > > > conf.setReducerClass(Reduce.class); > > } > > > > > > Now I am getting output array from the reducer as:- > > word \root\test\test123, \root\test12 > > > > In the next stage I want to stop 'stop words', scrub words etc. and > like > > position of the word in the document. How would I apply multiple maps or > > multilevel map reduce jobs programmatically? I guess I need to make > another > > class or add some functions in it? I am not able to figure it out. > > Any pointers for these type of problems? > > > > Thanks, > > Aayush > > > > > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]> > wrote: > > > >> On Wed, 26 Mar 2008, Aayush Garg wrote: > >> > >>> HI, > >>> I am developing the simple inverted index program frm the hadoop. My > map > >>> function has the output: > >>> > >>> and the reducer has: > >>> > >>> > >>> Now I want to use one more mapreduce to remove stop and scrub words > from > >> Use distributed cache as Arun mentioned. > >>> this output. Also in the next stage I would like to have short summay > >> Whether to use a separate MR job depends on what exactly you mean by > >> summary. If its like a window around the current word then you can > >> possibly do it in one go. > >> Amar > >>> associated with every word. How should I design my program from this > >> stage? > >>> I mean how would I apply multiple mapreduce to this? What would be the > >>> better way to perform this? > >>> > >>> Thanks, > >>> > >>> Regards, > >>> - > >>> > >>> > >> > > -- Aayush Garg, Phone: +41 76 482 240
Re: Streaming + custom input format
I saw that, but I don't know if it will put a jar into the classpath at the other end. On 4/4/08 9:56 AM, "Yuri Pradkin" <[EMAIL PROTECTED]> wrote: > There is a -file option to streaming that > -file File/dir to be shipped in the Job jar file > > On Friday 04 April 2008 09:24:59 am Ted Dunning wrote: >> At one point, it >> was necessary to unpack the streaming.jar file and put your own classes and >> jars into that. Last time I looked at the code, however, there was support >> for that happening magically, but in the 30 seconds I have allotted to help >> you (sorry bout that), I can't see that there is a command line option to >> trigger that, unless it is the one for including a file in the jar file. > >
Re: Streaming + custom input format
There is a -file option to streaming that -file File/dir to be shipped in the Job jar file On Friday 04 April 2008 09:24:59 am Ted Dunning wrote: > At one point, it > was necessary to unpack the streaming.jar file and put your own classes and > jars into that. Last time I looked at the code, however, there was support > for that happening magically, but in the 30 seconds I have allotted to help > you (sorry bout that), I can't see that there is a command line option to > trigger that, unless it is the one for including a file in the jar file.
Re: Is it possible in Hadoop to overwrite or update a file?
My suggestion actually is similar to what bigtable and hbase do. That is to keep some recent updates in memory, burping them to disk at relatively frequent intervals. Then when a number of burps are available, they can be merged to a larger burp. This pyramid can be extended as needed. Searches then would have to probe the files at each level to find the latest version of the record. File append or a highly reliable in-memory storage for the current partial burp (something like zookeeper or replicated memcache?) is required for this approach to not lose recent data on machine failure, but it has the potential for high write rates while maintaining reasonable read rates. Of course, truly random reads will kill you, but reads biased toward recently updated records should have awesome performance. On 4/4/08 5:04 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote: > Ted Dunning wrote: > >> This factor of 1500 in speed seems pretty significant and is the motivation >> for not supporting random read/write. >> >> This doesn't mean that random access update should never be done, but it >> does mean that scaling a design based around random access will be more >> difficult than scaling a design based on sequential read and write. >> >> On 4/3/08 12:07 PM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote: >> >>> In general, if updates are relatively frequent and small compared to the >>> size of data then this could be useful. >> >> > > Hehe ... yes, good calculations :) What I had in mind though when saying > "relatively frequent" was rather a situation when updates are usually > small and come at unpredictable intervals (e.g. picked from a queue > listener) and then need to set flags on a few records. Running > sequential update in face of such minor changes doesn't usually pay off, > and queueing the changes so that it starts to pay off is sometimes not > possible (takes too long to fill the batch). >
Re: Streaming + custom input format
Take a looks at the way that the text input format moves to the next line after a split point. There are a couple of possible problems with your input format not found problem. First, is your input in a package? If so, you need to provide a complete name for the class. Secondly, you have to give streaming information about how to package up your input format class for transfer to the cluster. Having access to the class on your initial invoking machine is not sufficient. At one point, it was necessary to unpack the streaming.jar file and put your own classes and jars into that. Last time I looked at the code, however, there was support for that happening magically, but in the 30 seconds I have allotted to help you (sorry bout that), I can't see that there is a command line option to trigger that, unless it is the one for including a file in the jar file. On 4/4/08 3:00 AM, "Francesco Tamberi" <[EMAIL PROTECTED]> wrote: > Hi All, > I have a streaming tool chain written in c++/python that performs some > operations on really big text files (gigabytes order); the chain reads files > and writes its result to standard output. > The chain needs to read well structured files and so I need to control how > hadoop splits files: it should splits a file only at suitable places. > What's the best way to do that? > I'm trying defining a custom input format in that way but I'm not sure it's > ok: > > public class MyInputFormat extends FileInputFormat { > ... > > public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException { > ... > } > } > > That said, I tried to run that (on hadoop 0.15.3, 0.16.0, 0.16.1) with: > > $HADOOP_HOME/bin/hadoop jar > $HADOOP_HOME/contrib/streaming/hadoop-0.16.1-streaming.jar -file ./allTools.sh > -mapper "allTools.sh" -jobconf mapred.reduce.tasks=0 -file pathToMyClass.class > -inputformat MyClass -input test.txt -output test-output > > But it raises an exception "-inputformat : class not found : MyClass" > I tried passing a jar instead of class file, putting them in HADOOP_CLASSPATH, > putting in system' CLASSPATH but always the same result.. > > Thank you for your patience! > -- Francesco > > >
Re: Newbie and basic questions
Hi Alberto, Here's my take as someone from the traditional RDBMS world who has been experimenting with Hadoop for a month, so don't take my comments to be definitive. On Fri, Apr 4, 2008 at 7:57 AM, Alberto Mesas <[EMAIL PROTECTED]> wrote: > We have been reading some doc, and playing with the basic samples that > come > with Hadoop 0.16.2. So let's see if we have understood everything :) > > We plan to use HCore for processing our logs, but would it be possible to > use it for a case like this one? > > MySQL Table with a few thousands of new rows every hour. > Every unique row must send by HTTP/GET to remote servers. > Update the row with the HTTP/GET result. > If HTTP/GET fails, retry again later this row. > > We can't do twice the same unique row, so we must be sure that each is > sent > once and only once, well at least until we get a valid response. This really sounds like an OLTP type system, your use of the word UPDATE pegs it as a pretty good candidate for a traditional row-store RDBMS. Hadoop is really directed at read intensive stuff. A good example would be analyzing log files and generating traffic statistics based on their content in large batch jobs. There is no question that you could build a system to solve your problem with Hadoop but it would take more doing than doing it with Oracle/Postgres/whatever. You might have a look at HBase though: http://wiki.apache.org/hadoop/Hbase I think of Hadoop as a very general tool that can be used to spread large tasks across cluster of machines. I think this fact is somewhat obscured by the comparisons often made between hadoop/mapreduce and relational databases. > > > Right now we are doing it using a simple java app that selects N rows > from > the table, and uses N threads to do N HTTP/GETs. Honestly if you have a simple app doing this for you I'd say stick with that. > If it's possible to implement it in Hadoop, we get easy ScaleOut and High > Availability. > > ScalingUp is becoming a little prob, and we have no real HA. > > I have doubts like, how to "select unique rows" and give them to Hadoop, > etc... > > Any advice is welcome :) > hope that helps -- Travis Brady www.mochiads.com
Re: If I wanna read a config file before map task, which class I should choose?
Just write a parser and put it into the configure method. On 4/3/08 8:31 PM, "Jeremy Chow" <[EMAIL PROTECTED]> wrote: > thanks, the configure file format looks like below, > > @tag_name0 name0 {value00, value01, value02} > @tag_name1 name1 {value10, value11, value12} > > and reading it from HDFS. Then how can I parse them ?
Re: Hadoop: Multiple map reduce or some better way
Are you implementing this for instruction or production? If production, why not use Lucene? On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote: > HI Amar , Theodore, Arun, > > Thanks for your reply. Actaully I am new to hadoop so cant figure out much. > I have written following code for inverted index. This code maps each word > from the document to its document id. > ex: apple file1 file123 > Main functions of the code are:- > > public class HadoopProgram extends Configured implements Tool { > public static class MapClass extends MapReduceBase > implements Mapper { > > private final static IntWritable one = new IntWritable(1); > private Text word = new Text(); > private Text doc = new Text(); > private long numRecords=0; > private String inputFile; > >public void configure(JobConf job){ > System.out.println("Configure function is called"); > inputFile = job.get("map.input.file"); > System.out.println("In conf the input file is"+inputFile); > } > > > public void map(LongWritable key, Text value, > OutputCollector output, > Reporter reporter) throws IOException { > String line = value.toString(); > StringTokenizer itr = new StringTokenizer(line); > doc.set(inputFile); > while (itr.hasMoreTokens()) { > word.set(itr.nextToken()); > output.collect(word,doc); > } > if(++numRecords%4==0){ >System.out.println("Finished processing of input file"+inputFile); > } > } > } > > /** >* A reducer class that just emits the sum of the input values. >*/ > public static class Reduce extends MapReduceBase > implements Reducer { > > // This works as K2, V2, K3, V3 > public void reduce(Text key, Iterator values, >OutputCollector output, >Reporter reporter) throws IOException { > int sum = 0; > Text dummy = new Text(); > ArrayList IDs = new ArrayList(); > String str; > > while (values.hasNext()) { > dummy = values.next(); > str = dummy.toString(); > IDs.add(str); >} >DocIDs dc = new DocIDs(); >dc.setListdocs(IDs); > output.collect(key,dc); > } > } > > public int run(String[] args) throws Exception { > System.out.println("Run function is called"); > JobConf conf = new JobConf(getConf(), WordCount.class); > conf.setJobName("wordcount"); > > // the keys are words (strings) > conf.setOutputKeyClass(Text.class); > > conf.setOutputValueClass(Text.class); > > > conf.setMapperClass(MapClass.class); > > conf.setReducerClass(Reduce.class); > } > > > Now I am getting output array from the reducer as:- > word \root\test\test123, \root\test12 > > In the next stage I want to stop 'stop words', scrub words etc. and like > position of the word in the document. How would I apply multiple maps or > multilevel map reduce jobs programmatically? I guess I need to make another > class or add some functions in it? I am not able to figure it out. > Any pointers for these type of problems? > > Thanks, > Aayush > > > On Thu, Mar 27, 2008 at 6:14 AM, Amar Kamat <[EMAIL PROTECTED]> wrote: > >> On Wed, 26 Mar 2008, Aayush Garg wrote: >> >>> HI, >>> I am developing the simple inverted index program frm the hadoop. My map >>> function has the output: >>> >>> and the reducer has: >>> >>> >>> Now I want to use one more mapreduce to remove stop and scrub words from >> Use distributed cache as Arun mentioned. >>> this output. Also in the next stage I would like to have short summay >> Whether to use a separate MR job depends on what exactly you mean by >> summary. If its like a window around the current word then you can >> possibly do it in one go. >> Amar >>> associated with every word. How should I design my program from this >> stage? >>> I mean how would I apply multiple mapreduce to this? What would be the >>> better way to perform this? >>> >>> Thanks, >>> >>> Regards, >>> - >>> >>> >>
Newbie and basic questions
We have been reading some doc, and playing with the basic samples that come with Hadoop 0.16.2. So let's see if we have understood everything :) We plan to use HCore for processing our logs, but would it be possible to use it for a case like this one? MySQL Table with a few thousands of new rows every hour. Every unique row must send by HTTP/GET to remote servers. Update the row with the HTTP/GET result. If HTTP/GET fails, retry again later this row. We can't do twice the same unique row, so we must be sure that each is sent once and only once, well at least until we get a valid response. Right now we are doing it using a simple java app that selects N rows from the table, and uses N threads to do N HTTP/GETs. If it's possible to implement it in Hadoop, we get easy ScaleOut and High Availability. ScalingUp is becoming a little prob, and we have no real HA. I have doubts like, how to "select unique rows" and give them to Hadoop, etc... Any advice is welcome :)
Re: Is it possible in Hadoop to overwrite or update a file?
Ted Dunning wrote: This factor of 1500 in speed seems pretty significant and is the motivation for not supporting random read/write. This doesn't mean that random access update should never be done, but it does mean that scaling a design based around random access will be more difficult than scaling a design based on sequential read and write. On 4/3/08 12:07 PM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote: In general, if updates are relatively frequent and small compared to the size of data then this could be useful. Hehe ... yes, good calculations :) What I had in mind though when saying "relatively frequent" was rather a situation when updates are usually small and come at unpredictable intervals (e.g. picked from a queue listener) and then need to set flags on a few records. Running sequential update in face of such minor changes doesn't usually pay off, and queueing the changes so that it starts to pay off is sometimes not possible (takes too long to fill the batch). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: distcp fails when copying from s3 to hdfs
Thanks for the quick response, Tom. I have just switched to Hadoop 0.16.2 and tried this again. Now I am getting the following error: Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source s3://id:[EMAIL PROTECTED]/file.txt does not exist. I copied the file to S3 using the following command: bin/hadoop distcp file.txt s3://id:[EMAIL PROTECTED]/file.txt To check that the file actually exists on S3, I tried the following commands: bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls The first returned nothing, while the second returned the following: Found 1 items /_distcp_logs_5vzva5 1969-12-31 19:00rwxrwxrwx And when I tried to copy it back to hdfs using the following command: bin/hadoop distcp s3://id:[EMAIL PROTECTED]/file.txt file2.txt I got this error: Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source s3://id:[EMAIL PROTECTED]/file.txt does not exist. at org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:504) at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:520) at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:596) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:612) Any pointers on why this could be happening? Thanks, Siddhartha On Fri, Apr 4, 2008 at 2:13 PM, Tom White <[EMAIL PROTECTED]> wrote: > Hi Siddhartha, > > This is a problem in 0.16.1 > (https://issues.apache.org/jira/browse/HADOOP-3027) that is fixed in > 0.16.2, which was released yesterday. > > Tom > > On 04/04/2008, Siddhartha Reddy <[EMAIL PROTECTED]> wrote: > > I am trying to run a Hadoop cluster on Amazon EC2 and backup all the > data on > > Amazon S3 between the runs. I am using Hadoop 0.16.1 on a cluster made > up of > > CentOS 5 images (ami-08f41161). > > > > > > I am able to copy from hdfs to S3 using the following command: > > > > bin/hadoop distcp file.txt s3://id:[EMAIL PROTECTED]/file.txt > > > > > > But copying from S3 to hdfs with the following command fails: > > > > bin/hadoop distcp s3://id:[EMAIL PROTECTED]/file.txt file2.txt > > > > > > with the following error: > > > > With failures, global counters are inaccurate; consider running with -i > > Copy failed: java.lang.IllegalArgumentException: Hook previously > registered > > at > > > java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:45) > > at java.lang.Runtime.addShutdownHook(Runtime.java:192) > > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1194) > > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148) > > at > org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:81) > > at > > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1180) > > at org.apache.hadoop.fs.FileSystem.access$400(FileSystem.java:53) > > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1197) > > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148) > > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) > > at org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:482) > > at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:504) > > at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:580) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > > at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:596) > > > > > > Can someone please point out if and what I am doing wrong? > > > > Thanks, > > > > Siddhartha Reddy > > > -- http://sids.in "If you are not having fun, you are not doing it right."
Streaming + custom input format
Hi All, I have a streaming tool chain written in c++/python that performs some operations on really big text files (gigabytes order); the chain reads files and writes its result to standard output. The chain needs to read well structured files and so I need to control how hadoop splits files: it should splits a file only at suitable places. What's the best way to do that? I'm trying defining a custom input format in that way but I'm not sure it's ok: public class MyInputFormat extends FileInputFormat { ... public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException { ... } } That said, I tried to run that (on hadoop 0.15.3, 0.16.0, 0.16.1) with: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.16.1-streaming.jar -file ./allTools.sh -mapper "allTools.sh" -jobconf mapred.reduce.tasks=0 -file pathToMyClass.class -inputformat MyClass -input test.txt -output test-output But it raises an exception "-inputformat : class not found : MyClass" I tried passing a jar instead of class file, putting them in HADOOP_CLASSPATH, putting in system' CLASSPATH but always the same result.. Thank you for your patience! -- Francesco
Re: distcp fails when copying from s3 to hdfs
Hi Siddhartha, This is a problem in 0.16.1 (https://issues.apache.org/jira/browse/HADOOP-3027) that is fixed in 0.16.2, which was released yesterday. Tom On 04/04/2008, Siddhartha Reddy <[EMAIL PROTECTED]> wrote: > I am trying to run a Hadoop cluster on Amazon EC2 and backup all the data on > Amazon S3 between the runs. I am using Hadoop 0.16.1 on a cluster made up of > CentOS 5 images (ami-08f41161). > > > I am able to copy from hdfs to S3 using the following command: > > bin/hadoop distcp file.txt s3://id:[EMAIL PROTECTED]/file.txt > > > But copying from S3 to hdfs with the following command fails: > > bin/hadoop distcp s3://id:[EMAIL PROTECTED]/file.txt file2.txt > > > with the following error: > > With failures, global counters are inaccurate; consider running with -i > Copy failed: java.lang.IllegalArgumentException: Hook previously registered > at > java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:45) > at java.lang.Runtime.addShutdownHook(Runtime.java:192) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1194) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148) > at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:81) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1180) > at org.apache.hadoop.fs.FileSystem.access$400(FileSystem.java:53) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1197) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) > at org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:482) > at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:504) > at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:580) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:596) > > > Can someone please point out if and what I am doing wrong? > > Thanks, > > Siddhartha Reddy >
Re: distcp fails :Input source not found
> However, when I try it on 0.15.3, it doesn't allow a folder copy. I have 100+ files in my S3 bucket, and I had to run "distcp" on each one of them to get them on HDFS on EC2 . Not a nice experience! This sounds like a bug - could you log a Jira issue for this please? Thanks, Tom
distcp fails when copying from s3 to hdfs
I am trying to run a Hadoop cluster on Amazon EC2 and backup all the data on Amazon S3 between the runs. I am using Hadoop 0.16.1 on a cluster made up of CentOS 5 images (ami-08f41161). I am able to copy from hdfs to S3 using the following command: bin/hadoop distcp file.txt s3://id:[EMAIL PROTECTED]/file.txt But copying from S3 to hdfs with the following command fails: bin/hadoop distcp s3://id:[EMAIL PROTECTED]/file.txt file2.txt with the following error: With failures, global counters are inaccurate; consider running with -i Copy failed: java.lang.IllegalArgumentException: Hook previously registered at java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:45) at java.lang.Runtime.addShutdownHook(Runtime.java:192) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1194) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148) at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:81) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1180) at org.apache.hadoop.fs.FileSystem.access$400(FileSystem.java:53) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1197) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:482) at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:504) at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:580) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:596) Can someone please point out if and what I am doing wrong? Thanks, Siddhartha Reddy