Namenode UI - Browse File System not working in psedo-dist cluster..
Hi, The Browse File System link in NameNode UI (http://namenode:50070) is not working when I run NameNode and 1 DataNode in the same system (Pseudo-distributed mode). I thought it might be jetty issue. But if I run 1 NameNode and 1 DataNode in one system and a DataNode in another system (total 1 NN and 2 DN), the Browse File System link is working fine and I can see the files in HDFS. Any idea why the issue with pseudo-distributed mode??? Thanks, Gokul
Re: Questions about SequenceFiles
Yeah, no I get that. But when you use the sequence file reader example from The Hadoop The Defintive Guide book page 106 reader = new SequenceFile.Reader(fs, path, conf); System.out.println(reader.getKeyClass()); System.out.println(reader.getValueClass()); Writable key = (Writable) ReflectionUtils.newInstance(reader .getKeyClass(), conf); Writable val = (Writable) ReflectionUtils.newInstance(reader .getValueClass(), conf); LuceneDocumentWrapper ldw = null; long position = reader.getPosition(); while (reader.next(key, val)) { ldw = (LuceneDocumentWrapper) val; System.out.println(ldw.get()); } But when using a LuceneDocumentWrapper which uses the interface, I get this error java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.hbase.mapreduce.LuceneDocumentWrapper.init() at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115) at com.iswcorp.mapreduce.test.SequenceFileReaderTest.main(SequenceFileReaderTest.java:39) Caused by: java.lang.NoSuchMethodException: org.apache.hadoop.hbase.mapreduce.LuceneDocumentWrapper.init() at java.lang.Class.getConstructor0(Class.java:2706) Caused by this lineWritable val = (Writable) ReflectionUtils.newInstance(reader .getValueClass(), conf); which has to do with not having a default constructor, which is why I asked the orginal question. Is there some other way to get the values out? Ananth T Sarathy On Mon, May 10, 2010 at 11:46 PM, Ted Yu yuzhih...@gmail.com wrote: Writable is the recommended interface to work with. Writable implementations reuse instances which serves large scale data processing better than JavaSerialization. Cheers On Mon, May 10, 2010 at 6:29 PM, Ananth Sarathy ananth.t.sara...@gmail.comwrote: My team and I were working with sequence files and were using the LuceneDocumentWrapper. But when I try to get the valcall, i get a no such method exception from the ReflectionUtils, which is caused because it's trying to call a default constructor which doesn't exist for that class. So my question is whether there is documentation or limitations to the type of objects that can be used with a sequencefile other than the Writable interface? I want to know if maybe I am trying to read from the file in the wrong way. Ananth T Sarathy
Re: Questions about SequenceFiles
The class implementing Writable should provide a public default constructor. On Tue, May 11, 2010 at 7:20 AM, Ananth Sarathy ananth.t.sara...@gmail.comwrote: Yeah, no I get that. But when you use the sequence file reader example from The Hadoop The Defintive Guide book page 106 reader = new SequenceFile.Reader(fs, path, conf); System.out.println(reader.getKeyClass()); System.out.println(reader.getValueClass()); Writable key = (Writable) ReflectionUtils.newInstance(reader .getKeyClass(), conf); Writable val = (Writable) ReflectionUtils.newInstance(reader .getValueClass(), conf); LuceneDocumentWrapper ldw = null; long position = reader.getPosition(); while (reader.next(key, val)) { ldw = (LuceneDocumentWrapper) val; System.out.println(ldw.get()); } But when using a LuceneDocumentWrapper which uses the interface, I get this error java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.hbase.mapreduce.LuceneDocumentWrapper.init() at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115) at com.iswcorp.mapreduce.test.SequenceFileReaderTest.main(SequenceFileReaderTest.java:39) Caused by: java.lang.NoSuchMethodException: org.apache.hadoop.hbase.mapreduce.LuceneDocumentWrapper.init() at java.lang.Class.getConstructor0(Class.java:2706) Caused by this lineWritable val = (Writable) ReflectionUtils.newInstance(reader .getValueClass(), conf); which has to do with not having a default constructor, which is why I asked the orginal question. Is there some other way to get the values out? Ananth T Sarathy On Mon, May 10, 2010 at 11:46 PM, Ted Yu yuzhih...@gmail.com wrote: Writable is the recommended interface to work with. Writable implementations reuse instances which serves large scale data processing better than JavaSerialization. Cheers On Mon, May 10, 2010 at 6:29 PM, Ananth Sarathy ananth.t.sara...@gmail.comwrote: My team and I were working with sequence files and were using the LuceneDocumentWrapper. But when I try to get the valcall, i get a no such method exception from the ReflectionUtils, which is caused because it's trying to call a default constructor which doesn't exist for that class. So my question is whether there is documentation or limitations to the type of objects that can be used with a sequencefile other than the Writable interface? I want to know if maybe I am trying to read from the file in the wrong way. Ananth T Sarathy
Re: Questions about SequenceFiles
I think this is a bug, writable object should have default no-argument constructor. On Tue, May 11, 2010 at 7:20 AM, Ananth Sarathy ananth.t.sara...@gmail.com wrote: Yeah, no I get that. But when you use the sequence file reader example from The Hadoop The Defintive Guide book page 106 reader = new SequenceFile.Reader(fs, path, conf); System.out.println(reader.getKeyClass()); System.out.println(reader.getValueClass()); Writable key = (Writable) ReflectionUtils.newInstance(reader .getKeyClass(), conf); Writable val = (Writable) ReflectionUtils.newInstance(reader .getValueClass(), conf); LuceneDocumentWrapper ldw = null; long position = reader.getPosition(); while (reader.next(key, val)) { ldw = (LuceneDocumentWrapper) val; System.out.println(ldw.get()); } But when using a LuceneDocumentWrapper which uses the interface, I get this error java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.hbase.mapreduce.LuceneDocumentWrapper.init() at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115) at com.iswcorp.mapreduce.test.SequenceFileReaderTest.main(SequenceFileReaderTest.java:39) Caused by: java.lang.NoSuchMethodException: org.apache.hadoop.hbase.mapreduce.LuceneDocumentWrapper.init() at java.lang.Class.getConstructor0(Class.java:2706) Caused by this line Writable val = (Writable) ReflectionUtils.newInstance(reader .getValueClass(), conf); which has to do with not having a default constructor, which is why I asked the orginal question. Is there some other way to get the values out? Ananth T Sarathy On Mon, May 10, 2010 at 11:46 PM, Ted Yu yuzhih...@gmail.com wrote: Writable is the recommended interface to work with. Writable implementations reuse instances which serves large scale data processing better than JavaSerialization. Cheers On Mon, May 10, 2010 at 6:29 PM, Ananth Sarathy ananth.t.sara...@gmail.comwrote: My team and I were working with sequence files and were using the LuceneDocumentWrapper. But when I try to get the valcall, i get a no such method exception from the ReflectionUtils, which is caused because it's trying to call a default constructor which doesn't exist for that class. So my question is whether there is documentation or limitations to the type of objects that can be used with a sequencefile other than the Writable interface? I want to know if maybe I am trying to read from the file in the wrong way. Ananth T Sarathy -- Best Regards Jeff Zhang
Re: Hadoop performance - xfs and ext4
On 23/04/10 15:43, Todd Lipcon wrote: Hi Stephen, Can you try mounting ext4 with the nodelalloc option? I've seen the same improvement due to delayed allocation butbeen a little nervous about that option (especially in the NN where we currently follow what the kernel people call an antipattern for image rotation). Hi Todd, Sorry for the delayed response - I had to wait for another test window before trying this out. To clarify, my namename and secondary namenode have been using ext4 in all tests - reconfiguring the datanodes is a fast operation, the nn and 2nn less so. I figure any big performance benefit would appear on the data nodes anyway and can then apply it back to the nn and 2nn if testing shows any benefits in changing. So I tried running our datanodes with their ext4 filesystems mounted using noatime,nodelalloc and after 6 runs of the TeraSort, it seems it runs SLOWER with those options by between 5-8%. The TeraGen itself seemed to run about 5% faster but it was only a single run so I'm not sure how reliable that is. hth, -stephen -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Re: Hadoop performance - xfs and ext4
On Tue, May 11, 2010 at 7:33 AM, stephen mulcahy stephen.mulc...@deri.orgwrote: On 23/04/10 15:43, Todd Lipcon wrote: Hi Stephen, Can you try mounting ext4 with the nodelalloc option? I've seen the same improvement due to delayed allocation butbeen a little nervous about that option (especially in the NN where we currently follow what the kernel people call an antipattern for image rotation). Hi Todd, Sorry for the delayed response - I had to wait for another test window before trying this out. To clarify, my namename and secondary namenode have been using ext4 in all tests - reconfiguring the datanodes is a fast operation, the nn and 2nn less so. I figure any big performance benefit would appear on the data nodes anyway and can then apply it back to the nn and 2nn if testing shows any benefits in changing. So I tried running our datanodes with their ext4 filesystems mounted using noatime,nodelalloc and after 6 runs of the TeraSort, it seems it runs SLOWER with those options by between 5-8%. The TeraGen itself seemed to run about 5% faster but it was only a single run so I'm not sure how reliable that is. Yep, that's what I'd expect. noatime should be a small improvement, nodelalloc should be a small detriment. The thing is that delayed allocation has some strange cases that could theoretically cause data loss after a power outage, so I was interested to see if it nullified all of your performance gains or if it were just a small hit. -Todd -- Todd Lipcon Software Engineer, Cloudera
Re: Questions about SequenceFiles
On Tue, May 11, 2010 at 7:48 AM, Ananth Sarathy ananth.t.sara...@gmail.com wrote: Ok, how can I report that? File a jira on the project that manages the type. I assume it is Lucene in this case. Also, it seems that requiring a no argument constructor but using an interface is kind of a broken paradigm. Shouldn't there be some other mechanism for this? The problem is that given a class name from the SequenceFile, we need to build an empty object. The most natural way to provide that capability is with a 0 argument constructor. -- Owen
Re: Hadoop performance - xfs and ext4
On Tue, May 11, 2010 at 10:39 AM, Todd Lipcon t...@cloudera.com wrote: On Tue, May 11, 2010 at 7:33 AM, stephen mulcahy stephen.mulc...@deri.orgwrote: On 23/04/10 15:43, Todd Lipcon wrote: Hi Stephen, Can you try mounting ext4 with the nodelalloc option? I've seen the same improvement due to delayed allocation butbeen a little nervous about that option (especially in the NN where we currently follow what the kernel people call an antipattern for image rotation). Hi Todd, Sorry for the delayed response - I had to wait for another test window before trying this out. To clarify, my namename and secondary namenode have been using ext4 in all tests - reconfiguring the datanodes is a fast operation, the nn and 2nn less so. I figure any big performance benefit would appear on the data nodes anyway and can then apply it back to the nn and 2nn if testing shows any benefits in changing. So I tried running our datanodes with their ext4 filesystems mounted using noatime,nodelalloc and after 6 runs of the TeraSort, it seems it runs SLOWER with those options by between 5-8%. The TeraGen itself seemed to run about 5% faster but it was only a single run so I'm not sure how reliable that is. Yep, that's what I'd expect. noatime should be a small improvement, nodelalloc should be a small detriment. The thing is that delayed allocation has some strange cases that could theoretically cause data loss after a power outage, so I was interested to see if it nullified all of your performance gains or if it were just a small hit. -Todd -- Todd Lipcon Software Engineer, Cloudera For most people doing tuning of the disk configuration for the NameNode is waisted time. Why? The current capacity of our hadoop cluster is Present Capacity: 48799678056 (101.09 TB) Yet the NameNode data itself is tiny. du -hs /usr/local/hadoop_root/hdfs_master 684M/usr/local/hadoop_root/hdfs_master Likely the entire Node table fits entirely inside the VFS cache, performance is not usually an issue, reliability is. The more exotic you get with this mount (EXT5, rarely used mount options), the less reliable it is going to be (IMHO). This is because your configuration space is not shared by that many people. DataNodes are a different story. These are worth tuning. I suggest configuring a single datanode as (say EXT4 with fancy options x,y,z), Wait a while get real production load at it, then look at some performance data and see if this node has any tangible difference in performance. Do not look for low level things like, bonnie say delete rate is +5 but create rate -%5. Look at the big picture, if you can't see a tangible big picture difference like ' map jobs seem to finish 5% faster on this node' what are you doing the tuning for :) ? I know this seems like a rather un-scientific approach, but disk tuning/performance measuring is very complex because application, VFS cache, available memory are the critical factors performance.
Namenode warnings
Hi, I saw a lot of warnings like the following in namenode log: 2010-05-11 06:45:07,186 WARN /: /listPaths/s: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.ListPathsServlet.doGet(ListPathsServlet.java:153) at javax.servlet.http.HttpServlet.service(HttpServlet.java:596) at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567) at org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635) at org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at org.mortbay.http.HttpServer.service(HttpServer.java:954) at org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981) at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244) at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) I am using Hadoop 0.19. Anybody knows what might be the problem? Thanks, Runping at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
Re: Hadoop performance - xfs and ext4
Did you try the XFS 'allocsize' mount parameter (for example, allocsize=8m)? This will reduce fragmentation during concurrent writes. Its more complicated, but using separate partitions for temp space versus HDFS also has an effect. XFS isn't as good with the temp space. In short, a single test with default configurations is useful, but doesn't complete the picture. Both file systems have several important tuning knobs. On Apr 22, 2010, at 1:02 AM, stephen mulcahy wrote: Hi, I've been tweaking our cluster roll-out process to refine it. While doing so, I decided to check if XFS gives any performance benefit over EXT4. As per a comment I read somewhere on the hbase wiki - XFS makes for faster formatting of filesystems (it takes us 5.5 minutes to rebuild a datanode from bare metal to a full Hadoop config on top of Debian Squeeze using XFS) versus EXT4 (same bare metal restore takes 9 minutes). However, TeraSort performance on a cluster of 45 of these data-nodes shows XFS is slower (same configuration settings on both installs other than changed filesystem), specifically, mkfs.xfs -f -l size=64m DEV (mounted with noatime,nodiratime,logbufs=8) gives me a cluster which runs TeraSort in about 23 minutes mkfs.ext4 -T largefile4 DEV (mounted with noatime) gives me a cluster which runs TeraSort in about 18.5 minutes So I'll be rolling our cluster back to EXT4, but thought the information might be useful/interesting to others. -stephen XFS config chosen from notes at http://everything2.com/index.pl?node_id=1479435 -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Re: Hadoop performance - xfs and ext4
Ah, one more thing. With XFS there is an online defragmenter -- it runs every night on my cluster. Performance on a fresh, empty system will not match a used one that has become fragmented. On Apr 22, 2010, at 1:02 AM, stephen mulcahy wrote: Hi, I've been tweaking our cluster roll-out process to refine it. While doing so, I decided to check if XFS gives any performance benefit over EXT4. As per a comment I read somewhere on the hbase wiki - XFS makes for faster formatting of filesystems (it takes us 5.5 minutes to rebuild a datanode from bare metal to a full Hadoop config on top of Debian Squeeze using XFS) versus EXT4 (same bare metal restore takes 9 minutes). However, TeraSort performance on a cluster of 45 of these data-nodes shows XFS is slower (same configuration settings on both installs other than changed filesystem), specifically, mkfs.xfs -f -l size=64m DEV (mounted with noatime,nodiratime,logbufs=8) gives me a cluster which runs TeraSort in about 23 minutes mkfs.ext4 -T largefile4 DEV (mounted with noatime) gives me a cluster which runs TeraSort in about 18.5 minutes So I'll be rolling our cluster back to EXT4, but thought the information might be useful/interesting to others. -stephen XFS config chosen from notes at http://everything2.com/index.pl?node_id=1479435 -- Stephen Mulcahy, DI2, Digital Enterprise Research Institute, NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com
Re: Namenode warnings
On May 11, 2010, at 9:53 AM, Runping Qi wrote: I am using Hadoop 0.19. Anybody knows what might be the problem? I think you answered your own question. :)
Re: Namenode warnings
So it's a known problem of Hadoop 0.19? On Tue, May 11, 2010 at 11:06 AM, Allen Wittenauer awittena...@linkedin.com wrote: On May 11, 2010, at 9:53 AM, Runping Qi wrote: I am using Hadoop 0.19. Anybody knows what might be the problem? I think you answered your own question. :)
Re: job executions fail with NotReplicatedYetException
For anyone else out there seeing this problem, this was alleviated for me by increasing the dfs.namenode.handler.count and dfs.datanode.handler.count. / Oscar On Mon, May 10, 2010 at 11:23 AM, Oscar Gothberg oscar.gothb...@gmail.com wrote: Hi, I keep having jobs fail at the very end, with 100% complete map, 100% complete reduce, due to NotReplicatedYetException w.r.t the _temporary subdirectory of the job output directory. It doesn't happen 100% of the time, so it's not trivially reproducible, but it happens enough (10-20% of runs) to make it a real pain. Any ideas, has anyone seen something similar? Part of the stack trace: NotReplicatedYetException: Not replicated yet:/test/out/dayperiod=14731/_temporary/_attempt_201005052338_0194_r_01_0/part-1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1253) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) ... Thanks, / Oscar
Hadoop Training @ Hadoop Summit - Early Bird Discount Expires Soon!
Hadoop Fans, just a quick note about training options at the Hadoop Summit. There are discounts expiring soon, so if you planned to attend, or didn't know, we want to make sure you stay in the loop. We're offering certification courses for developers and admins, as well as an introduction to Hadoop. We'll also debut courses on Hive and HBase because you asked for them. You get the cost of your Summit registration ($100) off any of these courses just by using the discount code included with your Summit registration email confirmation, but if you register 45 days in advance, you save another $100 per day (and Monday's courses are just 47 days out now!). Intro to Hadoop (Monday): http://www.eventbrite.com/event/621620283/apache0511 Cloudera Desktop SDK (Monday): http://www.eventbrite.com/event/621677454/apache0511 Hadoop for Developers + Certification (Wednesday-Thursday): http://www.eventbrite.com/event/621640343/apache0511 Hadoop for Administrators + Certification (Wednesday-Thursday): http://www.eventbrite.com/event/621643352/apache0511 Hive (Friday): http://www.eventbrite.com/event/621672439/apache0511 HBase (Friday): http://www.eventbrite.com/event/621670433/apache0511 You can see an overview here: http://www.cloudera.com/hadoop-training/hadoop-summit-2010/ Cheers, Christophe -- get hadoop: cloudera.com/hadoop online training: cloudera.com/hadoop-training blog: cloudera.com/blog twitter: twitter.com/cloudera
Re: Import the results into SimpleDB
Hi Mark, It would be better to create an outputformat instead of directly connecting from the mapper. The outputformat would be called regardless of the existence of the reducers. Make sure and set the job setNumReduceTasks(0). (I'm not sure setting the class to null would work.) Nick Sent by radiation. - Original Message - From: Mark Kerzner markkerz...@gmail.com To: core-u...@hadoop.apache.org core-u...@hadoop.apache.org Sent: Tue May 11 21:02:05 2010 Subject: Import the results into SimpleDB Hi, I want a Hadoop job that will simply take each line of the input text file and store it (after parsing) in a database, like SimpleDB. Can I put this code into Mapper, make no call to collect in it, and have no reducers at all? Do I set the reduce class to null, conf.setReducerClass(null)? or not set it at all? Thank you, Mark
Re: Namenode warnings
Hi Runping, This is a known issue. See https://issues.apache.org/jira/browse/HDFS-625. Nicholas Sze - Original Message From: Runping Qi runping...@gmail.com To: common-user@hadoop.apache.org Sent: Wed, May 12, 2010 12:53:13 AM Subject: Namenode warnings Hi, I saw a lot of warnings like the following in namenode log: 2010-05-11 06:45:07,186 WARN /: /listPaths/s: java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.ListPathsServlet.doGet(ListPathsServlet.java:153) at javax.servlet.http.HttpServlet.service(HttpServlet.java:596) at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567) at org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635) at org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at org.mortbay.http.HttpServer.service(HttpServer.java:954) at org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981) at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244) at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) I am using Hadoop 0.19. Anybody knows what might be the problem? Thanks, Runping at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
Re: Import the results into SimpleDB
Might as well not use Hadoop then... On Tue, 2010-05-11 at 21:02 -0500, Mark Kerzner wrote: Hi, I want a Hadoop job that will simply take each line of the input text file and store it (after parsing) in a database, like SimpleDB. Can I put this code into Mapper, make no call to collect in it, and have no reducers at all? Do I set the reduce class to null, conf.setReducerClass(null)? or not set it at all? Thank you, Mark
Re: Import the results into SimpleDB
Hi, Nick, should I then Provide the RecordWriter implementation in the OutputFormat, which will connect to the database and write a record to it, instead of to HDFS? Thank you, Mark On Tue, May 11, 2010 at 9:08 PM, Jones, Nick nick.jo...@amd.com wrote: Hi Mark, It would be better to create an outputformat instead of directly connecting from the mapper. The outputformat would be called regardless of the existence of the reducers. Make sure and set the job setNumReduceTasks(0). (I'm not sure setting the class to null would work.) Nick Sent by radiation. - Original Message - From: Mark Kerzner markkerz...@gmail.com To: core-u...@hadoop.apache.org core-u...@hadoop.apache.org Sent: Tue May 11 21:02:05 2010 Subject: Import the results into SimpleDB Hi, I want a Hadoop job that will simply take each line of the input text file and store it (after parsing) in a database, like SimpleDB. Can I put this code into Mapper, make no call to collect in it, and have no reducers at all? Do I set the reduce class to null, conf.setReducerClass(null)? or not set it at all? Thank you, Mark
Re: Import the results into SimpleDB
:) I create this text file in Hadoop. Only I want to make the db import a separate Hadoop job, run it in Amazon EMR, and make it fast by running sufficient number of nodes. Mark On Tue, May 11, 2010 at 9:13 PM, Darren Govoni dar...@ontrenet.com wrote: Might as well not use Hadoop then... On Tue, 2010-05-11 at 21:02 -0500, Mark Kerzner wrote: Hi, I want a Hadoop job that will simply take each line of the input text file and store it (after parsing) in a database, like SimpleDB. Can I put this code into Mapper, make no call to collect in it, and have no reducers at all? Do I set the reduce class to null, conf.setReducerClass(null)? or not set it at all? Thank you, Mark
Re: Import the results into SimpleDB
Mark, You can do it either ways. Create the connection object for the database in the configure() or setup() method of the mapper (depending on which api you are using) and insert the record from the mapper function. You dont have to have a reducer. If you create an output format, the mapper can directly write to it. In essence you'll be doing the same thing. Its easier to create an output format if you'll be writing more of such code. -Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Tue, May 11, 2010 at 7:15 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, Nick, should I then Provide the RecordWriter implementation in the OutputFormat, which will connect to the database and write a record to it, instead of to HDFS? Thank you, Mark On Tue, May 11, 2010 at 9:08 PM, Jones, Nick nick.jo...@amd.com wrote: Hi Mark, It would be better to create an outputformat instead of directly connecting from the mapper. The outputformat would be called regardless of the existence of the reducers. Make sure and set the job setNumReduceTasks(0). (I'm not sure setting the class to null would work.) Nick Sent by radiation. - Original Message - From: Mark Kerzner markkerz...@gmail.com To: core-u...@hadoop.apache.org core-u...@hadoop.apache.org Sent: Tue May 11 21:02:05 2010 Subject: Import the results into SimpleDB Hi, I want a Hadoop job that will simply take each line of the input text file and store it (after parsing) in a database, like SimpleDB. Can I put this code into Mapper, make no call to collect in it, and have no reducers at all? Do I set the reduce class to null, conf.setReducerClass(null)? or not set it at all? Thank you, Mark
Re: Import the results into SimpleDB
Might as well not use Hadoop then... Hadoop makes it easy to parallelize the work... Makes perfect sense to use it!
Re: Import the results into SimpleDB
Hi Mark, I haven't actually written one myself but take a look at DBOutputFormat as an example. If SimpleDB has a JDBC connector, it might work as is. Nick Sent by radiation. - Original Message - From: Mark Kerzner markkerz...@gmail.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Tue May 11 21:15:43 2010 Subject: Re: Import the results into SimpleDB Hi, Nick, should I then Provide the RecordWriter implementation in the OutputFormat, which will connect to the database and write a record to it, instead of to HDFS? Thank you, Mark On Tue, May 11, 2010 at 9:08 PM, Jones, Nick nick.jo...@amd.com wrote: Hi Mark, It would be better to create an outputformat instead of directly connecting from the mapper. The outputformat would be called regardless of the existence of the reducers. Make sure and set the job setNumReduceTasks(0). (I'm not sure setting the class to null would work.) Nick Sent by radiation. - Original Message - From: Mark Kerzner markkerz...@gmail.com To: core-u...@hadoop.apache.org core-u...@hadoop.apache.org Sent: Tue May 11 21:02:05 2010 Subject: Import the results into SimpleDB Hi, I want a Hadoop job that will simply take each line of the input text file and store it (after parsing) in a database, like SimpleDB. Can I put this code into Mapper, make no call to collect in it, and have no reducers at all? Do I set the reduce class to null, conf.setReducerClass(null)? or not set it at all? Thank you, Mark
Context needed by mapper
Hi, I am very new to the MapReduce paradigm so this could be a dumb question. What do you do if your mapper functions need to know more than just the data being processed in order to do their job? The simplest example I can think of is implementing a selective, phrase-based version of wordcount. Imagine you want to count the occurrences of all notable names (from the notable names database) in a large collection of news stories. You can't just count phrases - the number of potential word combinations is ridiculously large, and the vast majority are irrelevant. You have a limited (large, but bounded) vocabulary of phrases you are interested in--this list of names. You want each mapper to be aware of it, and only count the relevant phrases. You basically want to give each mapper read-only access to a HashSet of phrases as well as the documents they should be counting over. How would you do that? Cheers, Dave -- View this message in context: http://old.nabble.com/Context-needed-by-mapper-tp28532164p28532164.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Context needed by mapper
Hi, To count phrases, you can choose not to split the file by writing your own InputFormat which extends org.apache.hadoop.mapred.TextInputFormat, where you can avoid splitting of the text file by overriding isSplittable to return a false. Also, you have to provide your own RecordReader which can read phrases from the the given text file. Take a look at http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/mapred/RecordReader.html . Thanks, Prashant. On Wed, May 12, 2010 at 11:03 AM, DNMILNE d.n.mi...@gmail.com wrote: Hi, I am very new to the MapReduce paradigm so this could be a dumb question. What do you do if your mapper functions need to know more than just the data being processed in order to do their job? The simplest example I can think of is implementing a selective, phrase-based version of wordcount. Imagine you want to count the occurrences of all notable names (from the notable names database) in a large collection of news stories. You can't just count phrases - the number of potential word combinations is ridiculously large, and the vast majority are irrelevant. You have a limited (large, but bounded) vocabulary of phrases you are interested in--this list of names. You want each mapper to be aware of it, and only count the relevant phrases. You basically want to give each mapper read-only access to a HashSet of phrases as well as the documents they should be counting over. How would you do that? Cheers, Dave -- View this message in context: http://old.nabble.com/Context-needed-by-mapper-tp28532164p28532164.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Thanks and Regards, Prashant Ullegaddi, Search and Information Extraction Lab, IIIT-Hyderabad, India.