Re: Passing Command-line Parameters to the Job Submit Command
You could always write your own properties file and read it as resource. On Tue, Sep 25, 2012 at 12:10 AM, Hemanth Yamijala yhema...@gmail.comwrote: By java environment variables, do you mean the ones passed as -Dkey=value ? That's one way of passing them. I suppose another way is to have a client side site configuration (like mapred-site.xml) that is in the classpath of the client app. Thanks Hemanth On Tue, Sep 25, 2012 at 12:20 AM, Varad Meru meru.va...@gmail.com wrote: Thanks Hemanth, But in general, if we want to pass arguments to any job (not only PiEstimator from examples-jar) and submit the Job to the Job queue scheduler, by the looks of it, we might always need to use the java environment variables only. Is my above assumption correct? Thanks, Varad On Mon, Sep 24, 2012 at 9:48 AM, Hemanth Yamijala yhema...@gmail.com wrote: Varad, Looking at the code for the PiEstimator class which implements the 'pi' example, the two arguments are mandatory and are used *before* the job is submitted for execution - i.e on the client side. In particular, one of them (nSamples) is used not by the MapReduce job, but by the client code (i.e. PiEstimator) to generate some input. Hence, I believe all of this additional work that is being done by the PiEstimator class will be bypassed if we directly use the job -submit command. In other words, I don't think these two ways of running the job: - using the hadoop jar examples pi - using hadoop job -submit are equivalent. As a general answer to your question though, if additional parameters are used by the Mappers or reducers, then they will generally be set as additional job specific configuration items. So, one way of using them with the job -submit command will be to find out the specific names of the configuration items (from code, or some other documentation), and include them in the job.xml used when submitting the job. Thanks Hemanth On Sun, Sep 23, 2012 at 1:24 PM, Varad Meru meru.va...@gmail.com wrote: Hi, I want to run the PiEstimator example from using the following command $hadoop job -submit pieestimatorconf.xml which contains all the info required by hadoop to run the job. E.g. the input file location, the output file location and other details. propertynamemapred.jar/namevaluefile:Users/varadmeru/Work/Hadoop/hadoop-examples-1.0.3.jar/value/property propertynamemapred.map.tasks/namevalue20/value/property propertynamemapred.reduce.tasks/namevalue2/value/property ... propertynamemapred.job.name /namevaluePiEstimator/value/property propertynamemapred.output.dir/namevaluefile:Users/varadmeru/Work/out/value/property Now, as we now, to run the PiEstimator, we can use the following command too $hadoop jar hadoop-examples.1.0.3 pi 5 10 where 5 and 10 are the arguments to the main class of the PiEstimator. How can I pass the same arguments (5 and 10) using the job -submit command through conf. file or any other way, without changing the code of the examples to reflect the use of environment variables. Thanks in advance, Varad - Varad Meru Software Engineer, Business Intelligence and Analytics, Persistent Systems and Solutions Ltd., Pune, India.
Re: Number of Maps running more than expected
It would be helpful to see some statistics out of both the jobs like bytes read, written number of errors etc. On Thu, Aug 16, 2012 at 8:02 PM, Raj Vishwanathan rajv...@yahoo.com wrote: You probably have speculative execution on. Extra maps and reduce tasks are run in case some of them fail Raj Sent from my iPad Please excuse the typos. On Aug 16, 2012, at 11:36 AM, in.abdul in.ab...@gmail.com wrote: Hi Gaurav, Number map is not depents upon number block . It is really depends upon number of input splits . If you had 100GB of data and you had 10 split means then you can see only 10 maps . Please correct me if i am wrong Thanks and regards, Syed abdul kather On Aug 16, 2012 7:44 PM, Gaurav Dasgupta [via Lucene] ml-node+s472066n4001631...@n3.nabble.com wrote: Hi users, I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all the 12 nodes and 1 node running the Job Tracker). In order to perform a WordCount benchmark test, I did the following: - Executed RandomTextWriter first to create 100 GB data (Note that I have changed the test.randomtextwrite.total_bytes parameter only, rest all are kept default). - Next, executed the WordCount program for that 100 GB dataset. The Block Size in hdfs-site.xml is set as 128 MB. Now, according to my calculation, total number of Maps to be executed by the wordcount job should be 100 GB / 128 MB or 102400 MB / 128 MB = 800. But when I am executing the job, it is running a total number of 900 Maps, i.e., 100 extra. So, why this extra number of Maps? Although, my job is completing successfully without any error. Again, if I don't execute the RandomTextWwriter job to create data for my wordcount, rather I put my own 100 GB text file in HDFS and run WordCount, I can then see the number of Maps are equivalent to my calculation, i.e., 800. Can anyone tell me why this odd behaviour of Hadoop regarding the number of Maps for WordCount only when the dataset is generated by RandomTextWriter? And what is the purpose of these extra number of Maps? Regards, Gaurav Dasgupta -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631.html To unsubscribe from Lucene, click here http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472066code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw . NAML http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml - THANKS AND REGARDS, SYED ABDUL KATHER -- View this message in context: http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631p4001683.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Basic Question
On Tue, Aug 7, 2012 at 11:33 AM, Harsh J ha...@cloudera.com wrote: Each write call registers (writes) a KV pair to the output. The output collector does not look for similarities nor does it try to de-dupe it, and even if the object is the same, its value is copied so that doesn't matter. So you will get two KV pairs in your output - since duplication is allowed and is normal in several MR cases. Think of wordcount, where a map() call may emit lots of (is, 1) pairs if there are multiple is in the line it processes, and can use set() calls to its benefit to avoid too many object creation. Thanks! On Tue, Aug 7, 2012 at 11:56 PM, Mohit Anchlia mohitanch...@gmail.com wrote: In Mapper I often use a Global Text object and througout the map processing I just call set on it. My question is, what happens if collector receives similar byte array value. Does the last one overwrite the value in collector? So if I did Text zip = new Text(); zip.set(9099); collector.write(zip,value); zip.set(9099); collector.write(zip,value1); Should I expect to receive both values in reducer or just one? -- Harsh J
Setting Configuration for local file:///
I am trying to write a test on local file system but this test keeps taking xml files in the path even though I am setting a different Configuration object. Is there a way for me to override it? I thought the way I am doing overwrites the configuration but doesn't seem to be working: @Test public void testOnLocalFS() throws Exception{ Configuration conf = new Configuration(); conf.set(fs.default.name, file:///); conf.set(mapred.job.tracker, local); Path input = new Path(geoinput/geo.dat); Path output = new Path(geooutput/); FileSystem fs = FileSystem.getLocal(conf); fs.delete(output, true); log.info(Here); GeoLookupConfigRunner configRunner = new GeoLookupConfigRunner(); configRunner.setConf(conf); int exitCode = configRunner.run(new String[]{input.toString(), output.toString()}); Assert.assertEquals(exitCode, 0); }
Re: Setting Configuration for local file:///
On Tue, Aug 7, 2012 at 12:50 PM, Harsh J ha...@cloudera.com wrote: What is GeoLookupConfigRunner and how do you utilize the setConf(conf) object within it? Thanks for the pointer I wasn't setting my JobConf object with the conf that I passed. Just one more related question, if I use JobConf conf = new JobConf(getConf()) and I don't pass in any configuration then does the data from xml files in the path used? I want this to work for all the scenarios. On Wed, Aug 8, 2012 at 1:10 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying to write a test on local file system but this test keeps taking xml files in the path even though I am setting a different Configuration object. Is there a way for me to override it? I thought the way I am doing overwrites the configuration but doesn't seem to be working: @Test public void testOnLocalFS() throws Exception{ Configuration conf = new Configuration(); conf.set(fs.default.name, file:///); conf.set(mapred.job.tracker, local); Path input = new Path(geoinput/geo.dat); Path output = new Path(geooutput/); FileSystem fs = FileSystem.getLocal(conf); fs.delete(output, true); log.info(Here); GeoLookupConfigRunner configRunner = new GeoLookupConfigRunner(); configRunner.setConf(conf); int exitCode = configRunner.run(new String[]{input.toString(), output.toString()}); Assert.assertEquals(exitCode, 0); } -- Harsh J
Local jobtracker in test env?
I just wrote a test where fs.default.name is file:/// and mapred.job.tracker is set to local. The test ran fine, I also see mapper and reducer were invoked but what I am trying to understand is that how did this run without specifying the job tracker port and which port task tracker connected with job tracker. It's not clear from the output: Also what's the difference between this and bringing up miniDFS cluster? INFO org.apache.hadoop.mapred.FileInputFormat [main]: Total input paths to proc ess : 1 INFO org.apache.hadoop.mapred.JobClient [main]: Running job: job_local_0001 INFO org.apache.hadoop.mapred.Task [Thread-11]: Using ResourceCalculatorPlugin : null INFO org.apache.hadoop.mapred.MapTask [Thread-11]: numReduceTasks: 1 INFO org.apache.hadoop.mapred.MapTask [Thread-11]: io.sort.mb = 100 INFO org.apache.hadoop.mapred.MapTask [Thread-11]: data buffer = 79691776/99614 720 INFO org.apache.hadoop.mapred.MapTask [Thread-11]: record buffer = 262144/32768 0 INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z ip 92127 INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z ip 1 INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z ip 92127 INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z ip 1 INFO org.apache.hadoop.mapred.MapTask [Thread-11]: Starting flush of map output INFO org.apache.hadoop.mapred.MapTask [Thread-11]: Finished spill 0 INFO org.apache.hadoop.mapred.Task [Thread-11]: Task:attempt_local_0001_m_0 0_0 is done. And is in the process of commiting INFO org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: file:/c:/upb/dp/manch lia-dp/depot/services/data-platform/trunk/analytics/geoinput/geo.dat:0+18 INFO org.apache.hadoop.mapred.Task [Thread-11]: Task 'attempt_local_0001_m_ 00_0' done. INFO org.apache.hadoop.mapred.Task [Thread-11]: Using ResourceCalculatorPlugin : null INFO org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: INFO org.apache.hadoop.mapred.Merger [Thread-11]: Merging 1 sorted segments INFO org.apache.hadoop.mapred.Merger [Thread-11]: Down to the last merge-pass, with 1 segments left of total size: 26 bytes INFO org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: I nside reduce INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: O utside reduce INFO org.apache.hadoop.mapred.Task [Thread-11]: Task:attempt_local_0001_r_0 0_0 is done. And is in the process of commiting INFO org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: INFO org.apache.hadoop.mapred.Task [Thread-11]: Task attempt_local_0001_r_0 0_0 is allowed to commit now INFO org.apache.hadoop.mapred.FileOutputCommitter [Thread-11]: Saved output of task 'attempt_local_0001_r_00_0' to file:/c:/upb/dp/manchlia-dp/depot/servic es/data-platform/trunk/analytics/geooutput INFO org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: reduce reduce INFO org.apache.hadoop.mapred.Task [Thread-11]: Task 'attempt_local_0001_r_ 00_0' done. INFO org.apache.hadoop.mapred.JobClient [main]: map 100% reduce 100% INFO org.apache.hadoop.mapred.JobClient [main]: Job complete: job_local_0001 INFO org.apache.hadoop.mapred.JobClient [main]: Counters: 15 INFO org.apache.hadoop.mapred.JobClient [main]: FileSystemCounters INFO org.apache.hadoop.mapred.JobClient [main]: FILE_BYTES_READ=458 INFO org.apache.hadoop.mapred.JobClient [main]: FILE_BYTES_WRITTEN=96110 INFO org.apache.hadoop.mapred.JobClient [main]: Map-Reduce Framework INFO org.apache.hadoop.mapred.JobClient [main]: Map input records=2 INFO org.apache.hadoop.mapred.JobClient [main]: Reduce shuffle bytes=0 INFO org.apache.hadoop.mapred.JobClient [main]: Spilled Records=4 INFO org.apache.hadoop.mapred.JobClient [main]: Map output bytes=20 INFO org.apache.hadoop.mapred.JobClient [main]: Total committed heap usage (bytes)=321527808 INFO org.apache.hadoop.mapred.JobClient [main]: Map input bytes=18 INFO org.apache.hadoop.mapred.JobClient [main]: SPLIT_RAW_BYTES=142 INFO org.apache.hadoop.mapred.JobClient [main]: Combine input records=0 INFO org.apache.hadoop.mapred.JobClient [main]: Reduce input records=2 INFO org.apache.hadoop.mapred.JobClient [main]: Reduce input groups=1 INFO org.apache.hadoop.mapred.JobClient [main]: Combine output records=0 INFO org.apache.hadoop.mapred.JobClient [main]: Reduce output records=1 INFO org.apache.hadoop.mapred.JobClient [main]: Map output records=2 INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [main]: Inside reduce INFO com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [main]: Outsid e reduce Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.547 sec Results : Tests run: 4, Failures: 0, Errors: 0, Skipped: 0
Re: Avro
On Sat, Aug 4, 2012 at 11:43 PM, Nitin Kesarwani bumble@gmail.comwrote: Mohit, You can use this patch to suit your need: https://issues.apache.org/jira/browse/PIG-2579 New fields in Avro schema descriptor file need to have a non-null default value. Hence, using the new schema file, you should be able to read older data as well. Try it out. It is very straight forward. Hope this helps! Thanks! I am new to Avro what's the best place to see some examples of how Avro deals with schema changes? I am trying to find some examples. On Sun, Aug 5, 2012 at 12:01 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I've heard that Avro provides a good way of dealing with changing schemas. I am not sure how it could be done without keeping some kind of structure along with the data. Are there any good examples and documentation that I can look at? -N
Compression and Decompression
Is the compression done on the client side or on the server side? If I run hadoop fs -text then is this client decompressing the file for me?
Dealing with changing file format
I am wondering what's the right way to go about designing reading input and output where file format may change over period. For instance we might start with field1,field2,field3 but at some point we add new field4 in the input. What's the best way to deal with such scenarios? Keep a catalog of changes that timestamped?
Re: Sync and Data Replication
On Sun, Jun 10, 2012 at 9:39 AM, Harsh J ha...@cloudera.com wrote: Mohit, On Sat, Jun 9, 2012 at 11:11 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Thanks Harsh for detailed info. It clears things up. Only thing from those page is concerning is what happens when client crashes. It says you could lose upto a block worth of information. Is this still true given that NN would auto close the file? Where does it say this exactly? It is true that immediate readers will not get the last block (as it remains open and uncommitted), but once the lease recovery kicks in the file is closed successfully and the last block is indeed made available, so there's no 'data loss'. I saw it in Coherency Model - consequences of application design paragraph. Thanks for the information. It at least helps me in that I don't have to worry about the data loss when sync is not closed. Is it a good practice to reduce NN default value so that it auto-closes before 1 hr. I've not seen people do this/need to do this. Most don't run into such a situation and it is vital to properly close() files or sync() on file streams before making it available to readers. HBase manages open files during WAL-recovery using lightweight recoverLease APIs that were added for its benefit, so it doesn't need to wait for an hour for WALs to close and recover data. -- Harsh J
Re: Sync and Data Replication
Thanks Harsh for detailed info. It clears things up. Only thing from those page is concerning is what happens when client crashes. It says you could lose upto a block worth of information. Is this still true given that NN would auto close the file? Is it a good practice to reduce NN default value so that it auto-closes before 1 hr. Regarding OS cache, I think it should be ok since chances of loosing replica nodes all at the same time is low. On Sat, Jun 9, 2012 at 5:13 AM, Harsh J ha...@cloudera.com wrote: Hi Mohit, In this scenario is data also replicated as defined by the replication factor to other nodes as well? I am wondering if at this point if crash occurs do I have data in other nodes? What kind of crash are you talking about here? A client crash or a cluster crash? If a cluster, is the loss you're thinking of one DN or all the replicating DNs? If client fails to close a file due to a crash, it is auto-closed later (default is one hour) by the NameNode and whatever the client successfully wrote (i.e. into its last block) is then made available to readers at that point. If the client synced, then its last sync point is always available to readers and whatever it didn't sync is made available when the file is closed later by the NN. For DN failures, read on. Replication in 1.x/0.20.x is done via pipelines. Its done regardless of sync() calls. All write packets are indeed sent to and acknowledged by each DN in the constructed pipeline as the write progresses. For a good diagram on the sequence here, see Figure 3.3 | Page 66 | Chapter 3: The Hadoop Distributed Filesystem, in Tom's Hadoop: The Definitive Guide (2nd ed. page nos. Gotta get 3rd ed. soon :)) The sync behavior is further explained under the 'Coherency Model' title at Page 68 | Chapter 3: The Hadoop Distributed Filesystem of the same book. Think of sync() more as a checkpoint done over the write pipeline, such that new readers can read the length of synced bytes immediately and that they are guaranteed to be outside of the DN application (JVM) buffers (i.e. flushed). Some further notes, for general info: In 0.20.x/1.x releases, there's no hard-guarantee that the write buffer flushing done via sync ensures the data went to the *disk*. It may remain in the OS buffers (a feature in OSes, for performance). This is cause we do not do an fsync() (i.e. calling force on the FileChannel for the block and metadata outputs), but rather just an output stream flush. In the future, via 2.0.1-alpha release (soon to come at this point) and onwards, the specific call hsync() will ensure that this is not the case. However, if you are OK with the OS buffers feature/caveat and primarily need syncing not for reliability but for readers, you may use the call hflush() and save on performance. One place where hsync() is to be preferred instead of hflush() is where you use WALs (for data reliability), and HBase is one such application. With hsync(), HBase can survive potential failures caused by major power failure cases (among others). Let us know if this clears it up for you! On Sat, Jun 9, 2012 at 4:58 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am wondering the role of sync in replication of data to other nodes. Say client writes a line to a file in Hadoop, at this point file handle is open and sync has not been called. In this scenario is data also replicated as defined by the replication factor to other nodes as well? I am wondering if at this point if crash occurs do I have data in other nodes? -- Harsh J
Sync and Data Replication
I am wondering the role of sync in replication of data to other nodes. Say client writes a line to a file in Hadoop, at this point file handle is open and sync has not been called. In this scenario is data also replicated as defined by the replication factor to other nodes as well? I am wondering if at this point if crash occurs do I have data in other nodes?
Ideal file size
We have continuous flow of data into the sequence file. I am wondering what would be the ideal file size before file gets rolled over. I know too many small files are not good but could someone tell me what would be the ideal size such that it doesn't overload NameNode.
Re: Ideal file size
On Wed, Jun 6, 2012 at 9:48 AM, M. C. Srivas mcsri...@gmail.com wrote: Many factors to consider than just the size of the file. . How long can you wait before you *have to* process the data? 5 minutes? 5 hours? 5 days? If you want good timeliness, you need to roll-over faster. The longer you wait: 1. the lesser the load on the NN. 2. but the poorer the timeliness 3. and the larger chance of lost data (ie, the data is not saved until the file is closed and rolled over, unless you want to sync() after every write) To Begin with I was going to use Flume and specify rollover file size. I understand the above parameters, I just want to ensure that too many small files doesn't cause problem on the NameNode. For instance there would be times when we get GBs of data in an hour and at times only few 100 MB. From what Harsh, Edward and you've described it doesn't cause issues with the NameNode but rather increase in processing times if there are too many small files. Looks like I need to find that balance. It would also be interesting to see how others solve this problem when not using Flume. On Wed, Jun 6, 2012 at 7:00 AM, Mohit Anchlia mohitanch...@gmail.com wrote: We have continuous flow of data into the sequence file. I am wondering what would be the ideal file size before file gets rolled over. I know too many small files are not good but could someone tell me what would be the ideal size such that it doesn't overload NameNode.
Re: Writing click stream data to hadoop
On Fri, May 25, 2012 at 9:30 AM, Harsh J ha...@cloudera.com wrote: Mohit, Not if you call sync (or hflush/hsync in 2.0) periodically to persist your changes to the file. SequenceFile doesn't currently have a sync-API inbuilt in it (in 1.0 at least), but you can call sync on the underlying output stream instead at the moment. This is possible to do in 1.0 (just own the output stream). Your use case also sounds like you may want to simply use Apache Flume (Incubating) [http://incubator.apache.org/flume/] that already does provide these features and the WAL-kinda reliability you seek. Thanks Harsh, Does flume also provides API on top. I am getting this data as http call, how would I go about using flume with http calls? On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com wrote: We get click data through API calls. I now need to send this data to our hadoop environment. I am wondering if I could open one sequence file and write to it until it's of certain size. Once it's over the specified size I can close that file and open a new one. Is this a good approach? Only thing I worry about is what happens if the server crashes before I am able to cleanly close the file. Would I lose all previous data? -- Harsh J
Re: Bad connect ack with firstBadLink
Please see: http://hbase.apache.org/book.html#dfs.datanode.max.xcievers On Fri, May 4, 2012 at 5:46 AM, madhu phatak phatak@gmail.com wrote: Hi, We are running a three node cluster . From two days whenever we copy file to hdfs , it is throwing java.IO.Exception Bad connect ack with firstBadLink . I searched in net, but not able to resolve the issue. The following is the stack trace from datanode log 2012-05-04 18:08:08,868 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-7520371350112346377_50118 received exception java.net.SocketException: Connection reset 2012-05-04 18:08:08,869 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 172.23.208.17:50010, storageID=DS-1340171424-172.23.208.17-50010-1334672673051, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:168) at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:262) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:309) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:373) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:525) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103) at java.lang.Thread.run(Thread.java:662) It will be great if some one can point to the direction how to solve this problem. -- https://github.com/zinnia-phatak-dev/Nectar
Compressing map only output
Is there a way to compress map only jobs to compress map output that gets stored on hdfs as part-m-* files? In pig I used : Would these work form plain map reduce jobs as well? set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;
Re: Compressing map only output
Thanks! When I tried to search for this property I couldn't find it. Is there a page that has complete list of properties and it's usage? On Mon, Apr 30, 2012 at 5:44 PM, Prashant Kommireddi prash1...@gmail.comwrote: Yes. These are hadoop properties - using set is just a way for Pig to set those properties in your job conf. On Mon, Apr 30, 2012 at 5:25 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Is there a way to compress map only jobs to compress map output that gets stored on hdfs as part-m-* files? In pig I used : Would these work form plain map reduce jobs as well? set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;
Re: Compressing map only output
Thanks a lot for the link! On Mon, Apr 30, 2012 at 8:22 PM, Harsh J ha...@cloudera.com wrote: Hey Mohit, Most of what you need to know for jobs is available at http://hadoop.apache.org/common/docs/current/mapred_tutorial.html A more complete, mostly unseparated list of config params are also available at: http://hadoop.apache.org/common/docs/current/mapred-default.html (core-default.html, hdfs-default.html) On Tue, May 1, 2012 at 6:36 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Thanks! When I tried to search for this property I couldn't find it. Is there a page that has complete list of properties and it's usage? On Mon, Apr 30, 2012 at 5:44 PM, Prashant Kommireddi prash1...@gmail.comwrote: Yes. These are hadoop properties - using set is just a way for Pig to set those properties in your job conf. On Mon, Apr 30, 2012 at 5:25 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Is there a way to compress map only jobs to compress map output that gets stored on hdfs as part-m-* files? In pig I used : Would these work form plain map reduce jobs as well? set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec; -- Harsh J
Re: DFSClient error
Thanks for the quick response, appreciate it. It looks like this might be the issue. But I am still trying to understand what is causing so many threads in my situation? Is this thread per block that gets created or per file? Because if it's per file then it should not be more than 15. My second question, I read around 5 .gz files in 5 separate processed. This is constant and also the size of those 5 is roughly equivalent. So then why does it fail only halfway and not right in the begining. I am reading around 400 files and it always fails when I reach around 180th file. What's the default value of xceivers? Is 4096 consume too much of stack size? Thanks On Sun, Apr 29, 2012 at 1:14 PM, Harsh J ha...@cloudera.com wrote: It sounds to me like you're running out of DN xceivers. Try the solution offered at http://hbase.apache.org/book.html#dfs.datanode.max.xcievers I.e., add: property namedfs.datanode.max.xcievers/name value4096/value /property To your DNs' config/hdfs-site.xml and restart the DNs. On Mon, Apr 30, 2012 at 1:35 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I even tried to lower number of parallel jobs even further but I still get these errors. Any suggestion on how to troubleshoot this issue would be very helpful. Should I run hadoop fsck? How do people troubleshoot such issues?? Does it sound like a bug? 2012-04-27 14:37:42,921 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2012-04-27 14:37:42,931 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Exception in createBlockOutputStream 125.18.62.199:50010 java.io.EOFException 2012-04-27 14:37:42,932 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Abandoning block blk_6343044536824463287_24619 2012-04-27 14:37:42,932 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Excluding datanode 125.18.62.199:50010 2012-04-27 14:37:42,935 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Exception in createBlockOutputStream 125.18.62.204:50010 java.io.EOFException 2012-04-27 14:37:42,935 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Abandoning block blk_2837215798109471362_24620 2012-04-27 14:37:42,936 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Excluding datanode 125.18.62.204:50010 2012-04-27 14:37:42,937 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. 2012-04-27 14:37:42,939 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Exception in createBlockOutputStream 125.18.62.198:50010 java.io.EOFException 2012-04-27 14:37:42,939 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Abandoning block blk_2223489090936415027_24620 2012-04-27 14:37:42,940 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Excluding datanode 125.18.62.198:50010 2012-04-27 14:37:42,943 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Exception in createBlockOutputStream 125.18.62.197:50010 java.io.EOFException 2012-04-27 14:37:42,943 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Abandoning block blk_1265169201875643059_24620 2012-04-27 14:37:42,944 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Excluding datanode 125.18.62.197:50010 2012-04-27 14:37:42,945 [Thread-5] WARN org.apache.hadoop.hdfs.DFSClient - DataStreamer Exception: java.io.IOException: Unable to create new block. at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3446) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2627) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2822) 2012-04-27 14:37:42,945 [Thread-5] WARN org.apache.hadoop.hdfs.DFSClient - Error Recovery for block blk_1265169201875643059_24620 bad datanode[0] nodes == null 2012-04-27 14:37:42,945 [Thread-5] WARN org.apache.hadoop.hdfs.DFSClient - Could not get block locations. Source file /tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201204261707_0411/job.jar - Aborting... 2012-04-27 14:37:42,945 [Thread-4] INFO org.apache.hadoop.mapred.JobClient - Cleaning up the staging area hdfs://dsdb1:54310/tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201204261707_0411 2012-04-27 14:37:42,945 [Thread-4] ERROR org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:hadoop (auth:SIMPLE) cause:java.io.EOFException 2012-04-27 14:37:42,996 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Exception in createBlockOutputStream 125.18.62.200:50010java.io.IOException: Bad connect ack with firstBadLink as 125.18.62.198:50010 2012-04-27 14:37:42,996 [Thread-5] INFO org.apache.hadoop.hdfs.DFSClient - Abandoning block blk_-7583284266913502018_24621 2012-04-27 14:37:42,997 [Thread-5] INFO
Re: DFSClient error
After all the jobs fail I can't run anything. Once I restart the cluster I am able to run other jobs with no problems, hadoop fs and other io intensive jobs run just fine. On Fri, Apr 27, 2012 at 3:12 PM, John George john...@yahoo-inc.com wrote: Can you run a regular 'hadoop fs' (put orls or get) command? If yes, how about a wordcount example? 'path/hadoop jar pathhadoop-*examples*.jar wordcount input output' -Original Message- From: Mohit Anchlia mohitanch...@gmail.com Reply-To: common-user@hadoop.apache.org common-user@hadoop.apache.org Date: Fri, 27 Apr 2012 14:36:49 -0700 To: common-user@hadoop.apache.org common-user@hadoop.apache.org Subject: Re: DFSClient error I even tried to reduce number of jobs but didn't help. This is what I see: datanode logs: Initializing secure datanode resources Successfully obtained privileged resources (streaming port = ServerSocket[addr=/0.0.0.0,localport=50010] ) (http listener port = sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:50075]) Starting regular datanode initialization 26/04/2012 17:06:51 9858 jsvc.exec error: Service exit with a return value of 143 userlogs: 2012-04-26 19:35:22,801 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library is available 2012-04-26 19:35:22,801 INFO org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library loaded 2012-04-26 19:35:22,808 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded initialized native-zlib library 2012-04-26 19:35:22,903 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect to /125.18.62.197:50010, add to deadNodes and continue java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:298) at org.apache.hadoop.hdfs.DFSClient$RemoteBlockReader.newBlockReader(DFSClien t.java:1664) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.getBlockReader(DFSClient.j ava:2383) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java :2056) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2170) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(Decompr essorStream.java:97) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorSt ream.java:87) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.j ava:75) at java.io.InputStream.read(InputStream.java:85) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRe cordReader.java:114) at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:109) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordRead er.nextKeyValue(PigRecordReader.java:187) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapT ask.java:456) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation. java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264) 2012-04-26 19:35:22,906 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect to /125.18.62.204:50010, add to deadNodes and continue java.io.EOFException namenode logs: 2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobTracker: Job job_201204261140_0244 added successfully for user 'hadoop' to queue 'default' 2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobTracker: Initializing job_201204261140_0244 2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.AuditLogger: USER=hadoop IP=125.18.62.196OPERATION=SUBMIT_JOB TARGET=job_201204261140_0244RESULT=SUCCESS 2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobInProgress: Initializing job_201204261140_0244 2012-04-26 16:12:53,581 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 125.18.62.198:50010 java.io.IOException: Bad connect ack with firstBadLink as 125.18.62.197:50010 2012-04-26 16:12:53,581 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_2499580289951080275_22499 2012-04-26 16:12:53,582 INFO org.apache.hadoop.hdfs.DFSClient: Excluding datanode 125.18.62.197:50010 2012-04-26 16:12:53,594 INFO
Re: Design question
Ant suggestion or pointers would be helpful. Are there any best practices? On Mon, Apr 23, 2012 at 3:27 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I just wanted to check how do people design their storage directories for data that is sent to the system continuously. For eg: for a given functionality we get data feed continuously writen to sequencefile, that is then coverted to more structured format using map reduce and stored in tab separated files. For such continuous feed what's the best way to organize directories and the names? Should it be just based of timestamp or something better that helps in organizing data. Second part of question, is it better to store output in sequence files so that we can take advantage of compression per record. This seems to be required since gzip/snappy compression of entire file would launch only one map tasks. And the last question, when compressing a flat file should it first be split into multiple files so that we get multiple mappers if we need to run another job on this file? LZO is another alternative but then it requires additional configuration, is it preferred? Any articles or suggestions would be very helpful.
DFSClient error
I had 20 mappers in parallel reading 20 gz files and each file around 30-40MB data over 5 hadoop nodes and then writing to the analytics database. Almost midway it started to get this error: 2012-04-26 16:13:53,723 [Thread-8] INFO org.apache.hadoop.hdfs.DFSClient - Exception in createBlockOutputStream 17.18.62.192:50010java.io.IOException: Bad connect ack with firstBadLink as 17.18.62.191:50010 I am trying to look at the logs but doesn't say much. What could be the reason? We are in pretty closed reliable network and all machines are up.
Design question
I just wanted to check how do people design their storage directories for data that is sent to the system continuously. For eg: for a given functionality we get data feed continuously writen to sequencefile, that is then coverted to more structured format using map reduce and stored in tab separated files. For such continuous feed what's the best way to organize directories and the names? Should it be just based of timestamp or something better that helps in organizing data. Second part of question, is it better to store output in sequence files so that we can take advantage of compression per record. This seems to be required since gzip/snappy compression of entire file would launch only one map tasks. And the last question, when compressing a flat file should it first be split into multiple files so that we get multiple mappers if we need to run another job on this file? LZO is another alternative but then it requires additional configuration, is it preferred? Any articles or suggestions would be very helpful.
Re: Get Current Block or Split ID, and using it, the Block Path
I think if you called getInputFormat on JobConf and then called getSplits you would atleast get the locations. http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/InputSplit.html On Sun, Apr 8, 2012 at 9:16 AM, Deepak Nettem deepaknet...@gmail.comwrote: Hi, Is it possible to get the 'id' of the currently executing split or block from within the mapper? Using this block Id / split id, I want to be able to query the namenode to get the names of hosts having that block / spllit, and the actual path to the data. I need this for some analytics that I'm doing. Is there a client API that allows doing this? If not, what's the best way to do this? Best, Deepak Nettem
Re: Doubt from the book Definitive Guide
On Wed, Apr 4, 2012 at 10:02 PM, Prashant Kommireddi prash1...@gmail.comwrote: Hi Mohit, What would be the advantage? Reducers in most cases read data from all the mappers. In the case where mappers were to write to HDFS, a reducer would still require to read data from other datanodes across the cluster. Only advantage I was thinking of was that in some cases reducers might be able to take advantage of data locality and avoid multiple HTTP calls, no? Data is anyways written, so last merged file could go on HDFS instead of local disk. I am new to hadoop so just asking question to understand the rational behind using local disk for final output. Prashant On Apr 4, 2012, at 9:55 PM, Mohit Anchlia mohitanch...@gmail.com wrote: On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote: Hi Mohit, On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// The flow is simple as this: 1. For M+R job, map completes its task after writing all partitions down into the tasktracker's local filesystem (under mapred.local.dir directories). 2. Reducers fetch completion locations from events at JobTracker, and query the TaskTracker there to provide it the specific partition it needs, which is done over the TaskTracker's HTTP service (50060). So to clear things up - map doesn't send it to reduce, nor does reduce ask the actual map task. It is the task tracker itself that makes the bridge here. Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would be over Netty connections. This would be much more faster and reliable. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? A map-only job usually writes out to HDFS directly (no sorting done, cause no reducer is involved). If the job is a map+reduce one, the default output is collected to local filesystem for partitioning and sorting at map end, and eventually grouping at reduce end. Basically: Data you want to send to reducer from mapper goes to local FS for multiple actions to be performed on them, other data may directly go to HDFS. Reducers currently are scheduled pretty randomly but yes their scheduling can be improved for certain scenarios. However, if you are pointing that map partitions ought to be written to HDFS itself (with replication or without), I don't see performance improving. Note that the partitions aren't merely written but need to be sorted as well (at either end). To do that would need ability to spill frequently (cause we don't have infinite memory to do it all in RAM) and doing such a thing on HDFS would only mean slowdown. Thanks for clearing my doubts. In this case I was merely suggesting that if the mapper output (merged output in the end or the shuffle output) is stored in HDFS then reducers can just retrieve it from HDFS instead of asking tasktracker for it. Once reducer threads read it they can continue to work locally. I hope this helps clear some things up for you. -- Harsh J
Doubt from the book Definitive Guide
I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? - from the book --- Mapper The output file’s partitions are made available to the reducers over HTTP. The number of worker threads used to serve the file partitions is controlled by the tasktracker.http.threads property this setting is per tasktracker, not per map task slot. The default of 40 may need increasing for large clusters running large jobs.6.4.2. The Reduce Side Let’s turn now to the reduce part of the process. The map output file is sitting on the local disk of the tasktracker that ran the map task (note that although map outputs always get written to the local disk of the map tasktracker, reduce outputs may not be), but now it is needed by the tasktracker that is about to run the reduce task for the partition. Furthermore, the reduce task needs the map output for its particular partition from several map tasks across the cluster. The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes. This is known as the copy phase of the reduce task. The reduce task has a small number of copier threads so that it can fetch map outputs in parallel. The default is five threads, but this number can be changed by setting the mapred.reduce.parallel.copies property.
Re: Doubt from the book Definitive Guide
On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote: Hi Mohit, On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// The flow is simple as this: 1. For M+R job, map completes its task after writing all partitions down into the tasktracker's local filesystem (under mapred.local.dir directories). 2. Reducers fetch completion locations from events at JobTracker, and query the TaskTracker there to provide it the specific partition it needs, which is done over the TaskTracker's HTTP service (50060). So to clear things up - map doesn't send it to reduce, nor does reduce ask the actual map task. It is the task tracker itself that makes the bridge here. Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would be over Netty connections. This would be much more faster and reliable. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? A map-only job usually writes out to HDFS directly (no sorting done, cause no reducer is involved). If the job is a map+reduce one, the default output is collected to local filesystem for partitioning and sorting at map end, and eventually grouping at reduce end. Basically: Data you want to send to reducer from mapper goes to local FS for multiple actions to be performed on them, other data may directly go to HDFS. Reducers currently are scheduled pretty randomly but yes their scheduling can be improved for certain scenarios. However, if you are pointing that map partitions ought to be written to HDFS itself (with replication or without), I don't see performance improving. Note that the partitions aren't merely written but need to be sorted as well (at either end). To do that would need ability to spill frequently (cause we don't have infinite memory to do it all in RAM) and doing such a thing on HDFS would only mean slowdown. Thanks for clearing my doubts. In this case I was merely suggesting that if the mapper output (merged output in the end or the shuffle output) is stored in HDFS then reducers can just retrieve it from HDFS instead of asking tasktracker for it. Once reducer threads read it they can continue to work locally. I hope this helps clear some things up for you. -- Harsh J
Re: setNumTasks
Could someone please help me answer this question? On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchlia mohitanch...@gmail.comwrote: What is the corresponding system property for setNumTasks? Can it be used explicitly as system property like mapred.tasks.?
Re: setNumTasks
Sorry I meant *setNumMapTasks. *What is mapred.map.tasks for? It's confusing as to what it's purpose is for? I tried setting it for my job still I see more map tasks running than *mapred.map.tasks* On Thu, Mar 22, 2012 at 7:53 AM, Harsh J ha...@cloudera.com wrote: There isn't such an API as setNumTasks. There is however, setNumReduceTasks, which sets mapred.reduce.tasks. Does this answer your question? On Thu, Mar 22, 2012 at 8:21 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Could someone please help me answer this question? On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchlia mohitanch...@gmail.com wrote: What is the corresponding system property for setNumTasks? Can it be used explicitly as system property like mapred.tasks.? -- Harsh J
Re: SequenceFile split question
Thanks! that helps. I am reading small xml files from external file system and then writing to the SequenceFile. I made it stand alone client thinking that mapreduce may not be the best way to do this type of writing. My understanding was that map reduce is best suited for processing data within HDFS. Is map reduce also one of the options I should consider? On Thu, Mar 15, 2012 at 2:15 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Mohit If you are using a stand alone client application to do the same definitely there is just one instance of the same running and you'd be writing the sequence file to one hdfs block at a time. Once it reaches hdfs block size the writing continues to next block, in the mean time the first block is replicated. If you are doing the same job distributed as map reduce you'd be writing to to n files at a time when n is the number of tasks in your map reduce job. AFAIK the data node where the blocks have to be placed is determined by hadoop it is not controlled by end user application. But if you are triggering the stand alone job on a particular data node and if it has space one replica would be stored in the same. Same applies in case of MR tasks as well. Regards Bejoy.K.S On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I have a client program that creates sequencefile, which essentially merges small files into a big file. I was wondering how is sequence file splitting the data accross nodes. When I start the sequence file is empty. Does it get split when it reaches the dfs.block size? If so then does it mean that I am always writing to just one node at a given point in time? If I start a new client writing a new sequence file then is there a way to select a different data node?
Re: EOFException
This is actually just hadoop job over HDFS. I am assuming you also know why this is erroring out? On Thu, Mar 15, 2012 at 1:02 PM, Gopal absoft...@gmail.com wrote: On 03/15/2012 03:06 PM, Mohit Anchlia wrote: When I start a job to read data from HDFS I start getting these errors. Does anyone know what this means and how to resolve it? 2012-03-15 10:41:31,402 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Exception in createBlockOutputStream 164.28.62.204:50010java.io.** EOFException 2012-03-15 10:41:31,402 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Abandoning block blk_-6402969611996946639_11837 2012-03-15 10:41:31,403 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Excluding datanode 164.28.62.204:50010 2012-03-15 10:41:31,406 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Exception in createBlockOutputStream 164.28.62.198:50010java.io.** EOFException 2012-03-15 10:41:31,406 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Abandoning block blk_-5442664108986165368_11838 2012-03-15 10:41:31,407 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Exception in createBlockOutputStream 164.28.62.197:50010java.io.** EOFException 2012-03-15 10:41:31,407 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Abandoning block blk_-3373089616877234160_11838 2012-03-15 10:41:31,407 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Excluding datanode 164.28.62.198:50010 2012-03-15 10:41:31,409 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Excluding datanode 164.28.62.197:50010 2012-03-15 10:41:31,410 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Exception in createBlockOutputStream 164.28.62.204:50010java.io.** EOFException 2012-03-15 10:41:31,410 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Abandoning block blk_4481292025401332278_11838 2012-03-15 10:41:31,411 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Excluding datanode 164.28.62.204:50010 2012-03-15 10:41:31,412 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Exception in createBlockOutputStream 164.28.62.200:50010java.io.** EOFException 2012-03-15 10:41:31,412 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Abandoning block blk_-5326771177080888701_11838 2012-03-15 10:41:31,413 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Excluding datanode 164.28.62.200:50010 2012-03-15 10:41:31,414 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Exception in createBlockOutputStream 164.28.62.197:50010java.io.** EOFException 2012-03-15 10:41:31,414 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Abandoning block blk_-8073750683705518772_11839 2012-03-15 10:41:31,415 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Excluding datanode 164.28.62.197:50010 2012-03-15 10:41:31,416 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Exception in createBlockOutputStream 164.28.62.199:50010java.io.** EOFException 2012-03-15 10:41:31,416 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Exception in createBlockOutputStream 164.28.62.198:50010java.io.** EOFException 2012-03-15 10:41:31,416 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Abandoning block blk_441003866688859169_11838 2012-03-15 10:41:31,416 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Abandoning block blk_-466858474055876377_11839 2012-03-15 10:41:31,417 [Thread-5] INFO org.apache.hadoop.hdfs.**DFSClient - Excluding datanode 164.28.62.198:50010 2012-03-15 10:41:31,417 [Thread-5] WARN org.apache.hadoop.hdfs.**DFSClient - Try shutting down and restarting hbase.
SequenceFile split question
I have a client program that creates sequencefile, which essentially merges small files into a big file. I was wondering how is sequence file splitting the data accross nodes. When I start the sequence file is empty. Does it get split when it reaches the dfs.block size? If so then does it mean that I am always writing to just one node at a given point in time? If I start a new client writing a new sequence file then is there a way to select a different data node?
Re: mapred.tasktracker.map.tasks.maximum not working
Thanks. Looks like there are some parameters that I can use at client level and others need cluster wide setting. Is there a place where I can see all the config parameters with description of level of changes that can be done at client level vs at cluster level? On Fri, Mar 9, 2012 at 10:39 PM, bejoy.had...@gmail.com wrote: Adding on to Chen's response. This is a setting meant at Task Tracker level(environment setting based on parameters like your CPU cores, memory etc) and you need to override the same at each task tracker's mapred-site.xml and restart the TT daemon for changes to be in effect. Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: Chen He airb...@gmail.com Date: Fri, 9 Mar 2012 20:16:23 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: mapred.tasktracker.map.tasks.maximum not working you set the mapred.tasktracker.map.tasks.maximum in your job means nothing. Because Hadoop mapreduce platform only checks this parameter when it starts. This is a system configuration. You need to set it in your conf/mapred-site.xml file and restart your hadoop mapreduce. On Fri, Mar 9, 2012 at 7:32 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I have mapred.tasktracker.map.tasks.maximum set to 2 in my job and I have 5 nodes. I was expecting this to have only 10 concurrent jobs. But I have 30 mappers running. Does hadoop ignores this setting when supplied from the job?
mapred.map.tasks vs mapred.tasktracker.map.tasks.maximum
What's the difference between mapred.tasktracker.reduce.tasks.maximum and mapred.map.tasks ** I want my data to be split against only 10 mappers in the entire cluster. Can I do that using one of the above parameters?
Re: mapred.map.tasks vs mapred.tasktracker.map.tasks.maximum
What's the difference between setNumMapTasks and mapred.map.tasks? On Fri, Mar 9, 2012 at 5:00 PM, Chen He airb...@gmail.com wrote: Hi Mohit mapred.tasktracker.reduce(map).tasks.maximum means how many reduce(map) slot(s) you can have on each tasktracker. mapred.job.reduce(maps) means default number of reduce (map) tasks your job will has. To set the number of mappers in your application. You can write like this: *configuration.setNumMapTasks(the number you want);* Chen Actually, you can just use configuration.set() On Fri, Mar 9, 2012 at 6:42 PM, Mohit Anchlia mohitanch...@gmail.com wrote: What's the difference between mapred.tasktracker.reduce.tasks.maximum and mapred.map.tasks ** I want my data to be split against only 10 mappers in the entire cluster. Can I do that using one of the above parameters?
mapred.tasktracker.map.tasks.maximum not working
I have mapred.tasktracker.map.tasks.maximum set to 2 in my job and I have 5 nodes. I was expecting this to have only 10 concurrent jobs. But I have 30 mappers running. Does hadoop ignores this setting when supplied from the job?
Re: mapred.map.tasks vs mapred.tasktracker.map.tasks.maximum
Is this system parameter too? Or can I specify as mapred.map.tasks? I am using pig. On Fri, Mar 9, 2012 at 6:19 PM, Chen He airb...@gmail.com wrote: if you do not specify setNumMapTasks, by default, system will use the number you configured for mapred.map.tasks in the conf/mapred-site.xml file. On Fri, Mar 9, 2012 at 7:19 PM, Mohit Anchlia mohitanch...@gmail.com wrote: What's the difference between setNumMapTasks and mapred.map.tasks? On Fri, Mar 9, 2012 at 5:00 PM, Chen He airb...@gmail.com wrote: Hi Mohit mapred.tasktracker.reduce(map).tasks.maximum means how many reduce(map) slot(s) you can have on each tasktracker. mapred.job.reduce(maps) means default number of reduce (map) tasks your job will has. To set the number of mappers in your application. You can write like this: *configuration.setNumMapTasks(the number you want);* Chen Actually, you can just use configuration.set() On Fri, Mar 9, 2012 at 6:42 PM, Mohit Anchlia mohitanch...@gmail.com wrote: What's the difference between mapred.tasktracker.reduce.tasks.maximum and mapred.map.tasks ** I want my data to be split against only 10 mappers in the entire cluster. Can I do that using one of the above parameters?
Re: Profiling Hadoop Job
Can you check which user you are running this process as and compare it with the ownership on the directory? On Thu, Mar 8, 2012 at 3:13 PM, Leonardo Urbina lurb...@mit.edu wrote: Does anyone have any idea how to solve this problem? Regardless of whether I'm using plain HPROF or profiling through Starfish, I am getting the same error: Exception in thread main java.io.FileNotFoundException: attempt_201203071311_0004_m_ 00_0.profile (Permission denied) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:194) at java.io.FileOutputStream.init(FileOutputStream.java:84) at org.apache.hadoop.mapred.JobClient.downloadProfile(JobClient.java:1226) at org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1302) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251) at com.BitSight.hadoopAggregator.AggregatorDriver.run(AggregatorDriver.java:89) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at com.BitSight.hadoopAggregator.AggregatorDriver.main(AggregatorDriver.java:94) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) But I can't find what permissions to change to fix this issue. Any ideas? Thanks in advance, Best, -Leo On Wed, Mar 7, 2012 at 3:52 PM, Leonardo Urbina lurb...@mit.edu wrote: Thanks, -Leo On Wed, Mar 7, 2012 at 3:47 PM, Jie Li ji...@cs.duke.edu wrote: Hi Leo, Thanks for pointing out the outdated README file. Glad to tell you that we do support the old API in the latest version. See here: http://www.cs.duke.edu/starfish/previous.html Welcome to join our mailing list and your questions will reach more of our group members. Jie On Wed, Mar 7, 2012 at 3:37 PM, Leonardo Urbina lurb...@mit.edu wrote: Hi Jie, According to the Starfish README, the hadoop programs must be written using the new Hadoop API. This is not my case (I am using MultipleInputs among other non-new API supported features). Is there any way around this? Thanks, -Leo On Wed, Mar 7, 2012 at 3:19 PM, Jie Li ji...@cs.duke.edu wrote: Hi Leonardo, You might want to try Starfish which supports the memory profiling as well as cpu/disk/network profiling for the performance tuning. Jie -- Starfish is an intelligent performance tuning tool for Hadoop. Homepage: www.cs.duke.edu/starfish/ Mailing list: http://groups.google.com/group/hadoop-starfish On Wed, Mar 7, 2012 at 2:36 PM, Leonardo Urbina lurb...@mit.edu wrote: Hello everyone, I have a Hadoop job that I run on several GBs of data that I am trying to optimize in order to reduce the memory consumption as well as improve the speed. I am following the steps outlined in Tom White's Hadoop: The Definitive Guide for profiling using HPROF (p161), by setting the following properties in the JobConf: job.setProfileEnabled(true); job.setProfileParams(-agentlib:hprof=cpu=samples,heap=sites,depth=6, + force=n,thread=y,verbose=n,file=%s); job.setProfileTaskRange(true, 0-2); job.setProfileTaskRange(false, 0-2); I am trying to run this locally on a single pseudo-distributed install of hadoop (0.20.2) and it gives the following error: Exception in thread main java.io.FileNotFoundException: attempt_201203071311_0004_m_00_0.profile (Permission denied) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:194) at java.io.FileOutputStream.init(FileOutputStream.java:84) at org.apache.hadoop.mapred.JobClient.downloadProfile(JobClient.java:1226) at org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1302) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251) at com.BitSight.hadoopAggregator.AggregatorDriver.run(AggregatorDriver.java:89) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at com.BitSight.hadoopAggregator.AggregatorDriver.main(AggregatorDriver.java:94) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at
Re: Java Heap space error
I am still trying to see how to narrow this down. Is it possible to set heapdumponoutofmemoryerror option on these individual tasks? On Mon, Mar 5, 2012 at 5:49 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Sorry for multiple emails. I did find: 2012-03-05 17:26:35,636 INFO org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call- Usage threshold init = 715849728(699072K) used = 575921696(562423K) committed = 715849728(699072K) max = 715849728(699072K) 2012-03-05 17:26:35,719 INFO org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of 7816154 bytes from 1 objects. init = 715849728(699072K) used = 575921696(562423K) committed = 715849728(699072K) max = 715849728(699072K) 2012-03-05 17:26:36,881 INFO org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call - Collection threshold init = 715849728(699072K) used = 358720384(350312K) committed = 715849728(699072K) max = 715849728(699072K) 2012-03-05 17:26:36,885 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2012-03-05 17:26:36,888 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space at java.nio.HeapCharBuffer.init(HeapCharBuffer.java:39) at java.nio.CharBuffer.allocate(CharBuffer.java:312) at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:760) at org.apache.hadoop.io.Text.decode(Text.java:350) at org.apache.hadoop.io.Text.decode(Text.java:327) at org.apache.hadoop.io.Text.toString(Text.java:254) at org.apache.pig.piggybank.storage.SequenceFileLoader.translateWritableToPigDataType(SequenceFileLoader.java:105) at org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:139) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264) On Mon, Mar 5, 2012 at 5:46 PM, Mohit Anchlia mohitanch...@gmail.comwrote: All I see in the logs is: 2012-03-05 17:26:36,889 FATAL org.apache.hadoop.mapred.TaskTracker: Task: attempt_201203051722_0001_m_30_1 - Killed : Java heap space Looks like task tracker is killing the tasks. Not sure why. I increased heap from 512 to 1G and still it fails. On Mon, Mar 5, 2012 at 5:03 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I currently have java.opts.mapred set to 512MB and I am getting heap space errors. How should I go about debugging heap space issues?
Re: AWS MapReduce
On Mon, Mar 5, 2012 at 7:40 AM, John Conwell j...@iamjohn.me wrote: AWS MapReduce (EMR) does not use S3 for its HDFS persistance. If it did your S3 billing would be massive :) EMR reads all input jar files and input data from S3, but it copies these files down to its local disk. It then does starts the MR process, doing all HDFS reads and writes to the local disks. At the end of the MR job, it copies the MR job output and all process logs to S3, and then tears down the VM instances. You can see this for yourself if you spin up a small EMR cluster, but turn off the configuration flag that kills the VMs at the end if the MR job. Then look at the hadoop configuration files to see how hadoop is configured. I really like EMR. Amazon has done a lot of work to optimize the hadoop configurations and VM instance AMIs to execute MR jobs fairly efficiently on a VM cluster. I had to do a lot of (expensive) trial and error work to figure out an optimal hadoop / VM configuration to run our MR jobs without crashing / timing out the jobs. The only reason we didnt standardize on EMR was that it strongly bound your code base / process to using EMR for hadoop processing, vs a flexible infrastructure that could use a local cluster or cluster on a different cloud provider. Thanks for your input. I am assuming HDFS is created on ephemerial disks and not EBS. Also, is it possible to share some of your findings? On Sun, Mar 4, 2012 at 8:51 AM, Mohit Anchlia mohitanch...@gmail.com wrote: As far as I see in the docs it looks like you could also use hdfs instead of s3. But what I am not sure is if these are local disks or EBS. On Sun, Mar 4, 2012 at 2:27 AM, Hannes Carl Meyer hannesc...@googlemail.com wrote: Hi, yes, its loaded from S3. Imho is Amazon AWS Map-Reduce pretty slow. The setup is done pretty fast and there are some configuration parameters you can bypass - for example blocksizes etc. - but in the end imho setting up ec2 instances by copying images is the better alternative. Kind Regards Hannes On Sun, Mar 4, 2012 at 2:31 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I think found answer to this question. However, it's still not clear if HDFS is on local disk or EBS volumes. Does anyone know? On Sat, Mar 3, 2012 at 3:54 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Just want to check how many are using AWS mapreduce and understand the pros and cons of Amazon's MapReduce machines? Is it true that these map reduce machines are really reading and writing from S3 instead of local disks? Has anyone found issues with Amazon MapReduce and how does it compare with using MapReduce on local attached disks compared to using S3. --- www.informera.de Hadoop Big Data Services -- Thanks, John C
Re: Java Heap space error
All I see in the logs is: 2012-03-05 17:26:36,889 FATAL org.apache.hadoop.mapred.TaskTracker: Task: attempt_201203051722_0001_m_30_1 - Killed : Java heap space Looks like task tracker is killing the tasks. Not sure why. I increased heap from 512 to 1G and still it fails. On Mon, Mar 5, 2012 at 5:03 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I currently have java.opts.mapred set to 512MB and I am getting heap space errors. How should I go about debugging heap space issues?
Re: Java Heap space error
Sorry for multiple emails. I did find: 2012-03-05 17:26:35,636 INFO org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call- Usage threshold init = 715849728(699072K) used = 575921696(562423K) committed = 715849728(699072K) max = 715849728(699072K) 2012-03-05 17:26:35,719 INFO org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of 7816154 bytes from 1 objects. init = 715849728(699072K) used = 575921696(562423K) committed = 715849728(699072K) max = 715849728(699072K) 2012-03-05 17:26:36,881 INFO org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call - Collection threshold init = 715849728(699072K) used = 358720384(350312K) committed = 715849728(699072K) max = 715849728(699072K) 2012-03-05 17:26:36,885 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2012-03-05 17:26:36,888 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space at java.nio.HeapCharBuffer.init(HeapCharBuffer.java:39) at java.nio.CharBuffer.allocate(CharBuffer.java:312) at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:760) at org.apache.hadoop.io.Text.decode(Text.java:350) at org.apache.hadoop.io.Text.decode(Text.java:327) at org.apache.hadoop.io.Text.toString(Text.java:254) at org.apache.pig.piggybank.storage.SequenceFileLoader.translateWritableToPigDataType(SequenceFileLoader.java:105) at org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:139) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264) On Mon, Mar 5, 2012 at 5:46 PM, Mohit Anchlia mohitanch...@gmail.comwrote: All I see in the logs is: 2012-03-05 17:26:36,889 FATAL org.apache.hadoop.mapred.TaskTracker: Task: attempt_201203051722_0001_m_30_1 - Killed : Java heap space Looks like task tracker is killing the tasks. Not sure why. I increased heap from 512 to 1G and still it fails. On Mon, Mar 5, 2012 at 5:03 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I currently have java.opts.mapred set to 512MB and I am getting heap space errors. How should I go about debugging heap space issues?
Re: AWS MapReduce
As far as I see in the docs it looks like you could also use hdfs instead of s3. But what I am not sure is if these are local disks or EBS. On Sun, Mar 4, 2012 at 2:27 AM, Hannes Carl Meyer hannesc...@googlemail.com wrote: Hi, yes, its loaded from S3. Imho is Amazon AWS Map-Reduce pretty slow. The setup is done pretty fast and there are some configuration parameters you can bypass - for example blocksizes etc. - but in the end imho setting up ec2 instances by copying images is the better alternative. Kind Regards Hannes On Sun, Mar 4, 2012 at 2:31 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I think found answer to this question. However, it's still not clear if HDFS is on local disk or EBS volumes. Does anyone know? On Sat, Mar 3, 2012 at 3:54 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Just want to check how many are using AWS mapreduce and understand the pros and cons of Amazon's MapReduce machines? Is it true that these map reduce machines are really reading and writing from S3 instead of local disks? Has anyone found issues with Amazon MapReduce and how does it compare with using MapReduce on local attached disks compared to using S3. --- www.informera.de Hadoop Big Data Services
Re: AWS MapReduce
I think found answer to this question. However, it's still not clear if HDFS is on local disk or EBS volumes. Does anyone know? On Sat, Mar 3, 2012 at 3:54 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Just want to check how many are using AWS mapreduce and understand the pros and cons of Amazon's MapReduce machines? Is it true that these map reduce machines are really reading and writing from S3 instead of local disks? Has anyone found issues with Amazon MapReduce and how does it compare with using MapReduce on local attached disks compared to using S3.
Re: Hadoop pain points?
+1 On Fri, Mar 2, 2012 at 4:09 PM, Harsh J ha...@cloudera.com wrote: Since you ask about anything in general, when I forayed into using Hadoop, my biggest pain was lack of documentation clarity and completeness over the MR and DFS user APIs (and other little points). It would be nice to have some work done to have one example or semi-example for every single Input/OutputFormat, Mapper/Reducer implementations, etc. added to the javadocs. I believe examples and snippets help out a ton (tons more than explaining just behavior) to new devs. On Fri, Mar 2, 2012 at 9:45 PM, Kunaal kunalbha...@alumni.cmu.edu wrote: I am doing a general poll on what are the most prevalent pain points that people run into with Hadoop? These could be performance related (memory usage, IO latencies), usage related or anything really. The goal is to look for what areas this platform could benefit the most in the near future. Any feedback is much appreciated. Thanks, Kunal. -- Harsh J
kill -QUIT
When I try kill -QUIT for a job it doesn't send the stacktrace to the log files. Does anyone know why or if I am doing something wrong? I find the job using ps -ef|grep attempt. I then go to logs/userLogs/jobid/attemptid/
Adding nodes
Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes?
Re: Adding nodes
On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes?
Re: Adding nodes
On Thu, Mar 1, 2012 at 4:57 PM, Joey Echeverria j...@cloudera.com wrote: Not quite. Datanodes get the namenode host from fs.defalt.name in core-site.xml. Task trackers find the job tracker from the mapred.job.tracker setting in mapred-site.xml. I actually meant to ask how does namenode/jobtracker know there is a new node in the cluster. Is it initiated by namenode when slave file is edited? Or is it initiated by tasktracker when tasktracker is started? Sent from my iPhone On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes?
Re: Adding nodes
Thanks all for the answers!! On Thu, Mar 1, 2012 at 5:52 PM, Arpit Gupta ar...@hortonworks.com wrote: It is initiated by the slave. If you have defined files to state which slaves can talk to the namenode (using config dfs.hosts) and which hosts cannot (using property dfs.hosts.exclude) then you would need to edit these files and issue the refresh command. On Mar 1, 2012, at 5:35 PM, Mohit Anchlia wrote: On Thu, Mar 1, 2012 at 4:57 PM, Joey Echeverria j...@cloudera.com wrote: Not quite. Datanodes get the namenode host from fs.defalt.name in core-site.xml. Task trackers find the job tracker from the mapred.job.tracker setting in mapred-site.xml. I actually meant to ask how does namenode/jobtracker know there is a new node in the cluster. Is it initiated by namenode when slave file is edited? Or is it initiated by tasktracker when tasktracker is started? Sent from my iPhone On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes? -- Arpit Hortonworks, Inc. email: ar...@hortonworks.com http://www.hadoopsummit.org/ http://www.hadoopsummit.org/ http://www.hadoopsummit.org/
Re: 100x slower mapreduce compared to pig
I am going to try few things today. I have a JAXBContext object that marshals the xml, this is static instance but my guess at this point is that since this is in separate jar then the one where job runs and I used DistributeCache.addClassPath this context is being created on every call for some reason. I don't know why that would be. I am going to create this instance as static in the mapper class itself and see if that helps. I also add debugs. Will post the results after try it out. On Tue, Feb 28, 2012 at 4:18 PM, Prashant Kommireddi prash1...@gmail.comwrote: It would be great if we can take a look at what you are doing in the UDF vs the Mapper. 100x slow does not make sense for the same job/logic, its either the Mapper code or may be the cluster was busy at the time you scheduled MapReduce job? Thanks, Prashant On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am comparing runtime of similar logic. The entire logic is exactly same but surprisingly map reduce job that I submit is 100x slow. For pig I use udf and for hadoop I use mapper only and the logic same as pig. Even the splits on the admin page are same. Not sure why it's so slow. I am submitting job like: java -classpath .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar com.services.dp.analytics.hadoop.mapred.FormMLProcessor /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq /examples/output1/ How should I go about looking the root cause of why it's so slow? Any suggestions would be really appreciated. One of the things I noticed is that on the admin page of map task list I see status as hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728 but for pig the status is blank.
Re: 100x slower mapreduce compared to pig
I think I've found the problem. There was one line of code that caused this issue :) that was output.collect(key, value); I had to add more logging to the code to get to it. For some reason kill -QUIT didn't send the stacktrace to the userLogs/job/attempt/syslog , I searched all the logs and couldn't find one. Does anyone know where stacktraces are generally sent? On Wed, Feb 29, 2012 at 1:08 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I can't seem to find what's causing this slowness. Nothing in the logs. It's just painfuly slow. However, pig job is awesome in performance that has the same logic. Here is the mapper code and the pig code: *public* *static* *class* Map *extends* MapReduceBase *implements* MapperText, Text, Text, Text { *public* *void* map(Text key, Text value, OutputCollectorText, Text output, Reporter reporter) *throws* IOException { String line = value.toString(); //log.info(output key: + key + value + value + value + line); FormMLType f; *try* { f = FormMLUtils.*convertToRows*(line); FormMLStack fm = *new* FormMLStack(f,key.toString()); fm.parseFormML(); *for* (String *row* : fm.getFormattedRecords(*false*)){ output.collect(key, value); } } *catch* (JAXBException e) { *log*.error(Error processing record + key, e); } } } And here is the pig udf: *public* DataBag exec(Tuple input) *throws* IOException { *try* { DataBag output = mBagFactory.newDefaultBag(); Object o = input.get(1); *if* (!(o *instanceof* String)) { *throw* *new* IOException( Expected document input to be chararray, but got + o.getClass().getName()); } Object o1 = input.get(0); *if* (!(o1 *instanceof* String)) { *throw* *new* IOException( Expected input to be chararray, but got + o.getClass().getName()); } String document = (String)o; String filename = (String)o1; FormMLType f = FormMLUtils.*convertToRows*(document); FormMLStack fm = *new* FormMLStack(f,filename); fm.parseFormML(); *for* (String row : fm.getFormattedRecords(*false*)){ output.add( mTupleFactory.newTuple(row)); } *return* output; } *catch* (ExecException ee) { log.error(Failed to Process , ee); *throw* ee; } *catch* (JAXBException e) { // *TODO* Auto-generated catch block log.error(Invalid xml, e); *throw* *new* IllegalArgumentException(invalid xml + e.getCause().getMessage()); } } On Wed, Feb 29, 2012 at 9:27 AM, Mohit Anchlia mohitanch...@gmail.comwrote: I am going to try few things today. I have a JAXBContext object that marshals the xml, this is static instance but my guess at this point is that since this is in separate jar then the one where job runs and I used DistributeCache.addClassPath this context is being created on every call for some reason. I don't know why that would be. I am going to create this instance as static in the mapper class itself and see if that helps. I also add debugs. Will post the results after try it out. On Tue, Feb 28, 2012 at 4:18 PM, Prashant Kommireddi prash1...@gmail.com wrote: It would be great if we can take a look at what you are doing in the UDF vs the Mapper. 100x slow does not make sense for the same job/logic, its either the Mapper code or may be the cluster was busy at the time you scheduled MapReduce job? Thanks, Prashant On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am comparing runtime of similar logic. The entire logic is exactly same but surprisingly map reduce job that I submit is 100x slow. For pig I use udf and for hadoop I use mapper only and the logic same as pig. Even the splits on the admin page are same. Not sure why it's so slow. I am submitting job like: java -classpath .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar com.services.dp.analytics.hadoop.mapred.FormMLProcessor /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq /examples/output1/ How should I go about looking the root cause of why it's so slow? Any suggestions would be really appreciated. One of the things I noticed is that on the admin page of map task list I see status as hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728 but for pig the status is blank.
Re: Invocation exception
Thanks for the example. I did look at the logs and also at the admin page and all I see is the exception that I posted initially. I am not sure why adding an extra jar to the classpath in DistributedCache causes that exception. I tried to look at Configuration code in hadoop.util package but it doesn't tell much. It looks like it's throwing on this line configureMethod.invoke(theObject, conf); in below code. *private* *static* *void* setJobConf(Object theObject, Configuration conf) { //If JobConf and JobConfigurable are in classpath, AND //theObject is of type JobConfigurable AND //conf is of type JobConf then //invoke configure on theObject *try* { Class? jobConfClass = conf.getClassByName(org.apache.hadoop.mapred.JobConf); Class? jobConfigurableClass = conf.getClassByName(org.apache.hadoop.mapred.JobConfigurable); *if* (jobConfClass.isAssignableFrom(conf.getClass()) jobConfigurableClass.isAssignableFrom(theObject.getClass())) { Method configureMethod = jobConfigurableClass.getMethod(configure, jobConfClass); configureMethod.invoke(theObject, conf); } } *catch* (ClassNotFoundException e) { //JobConf/JobConfigurable not in classpath. no need to configure } *catch* (Exception e) { *throw* *new* RuntimeException(Error in configuring object, e); } } On Tue, Feb 28, 2012 at 9:25 PM, Harsh J ha...@cloudera.com wrote: Mohit, If you visit the failed task attempt on the JT Web UI, you can see the complete, informative stack trace on it. It would point the exact line the trouble came up in and what the real error during the configure-phase of task initialization was. A simple attempts page goes like the following (replace job ID and task ID of course): http://host:50030/taskdetails.jsp?jobid=job_201202041249_3964tipid=task_201202041249_3964_m_00 Once there, find and open the All logs link to see stdout, stderr, and syslog of the specific failed task attempt. You'll have more info sifting through this to debug your issue. This is also explained in Tom's book under the title Debugging a Job (p154, Hadoop: The Definitive Guide, 2nd ed.). On Wed, Feb 29, 2012 at 1:40 AM, Mohit Anchlia mohitanch...@gmail.com wrote: It looks like adding this line causes invocation exception. I looked in hdfs and I see that file in that path DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar), conf); I have similar code for another jar DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar), conf); but this works just fine. On Tue, Feb 28, 2012 at 11:44 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I commented reducer and combiner both and still I see the same exception. Could it be because I have 2 jars being added? On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.com wrote: On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com wrote: For some reason I am getting invocation exception and I don't see any more details other than this exception: My job is configured as: JobConf conf = *new* JobConf(FormMLProcessor.*class*); conf.addResource(hdfs-site.xml); conf.addResource(core-site.xml); conf.addResource(mapred-site.xml); conf.set(mapred.reduce.tasks, 0); conf.setJobName(mlprocessor); DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar), conf); DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar), conf); conf.setOutputKeyClass(Text.*class*); conf.setOutputValueClass(Text.*class*); conf.setMapperClass(Map.*class*); conf.setCombinerClass(Reduce.*class*); conf.setReducerClass(IdentityReducer.*class*); Why would you set the Reducer when the number of reducers is set to zero. Not sure if this is the real cause. conf.setInputFormat(SequenceFileAsTextInputFormat.*class*); conf.setOutputFormat(TextOutputFormat.*class*); FileInputFormat.*setInputPaths*(conf, *new* Path(args[0])); FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1])); JobClient.*runJob*(conf); - * java.lang.RuntimeException*: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(* ReflectionUtils.java:93*) at org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*) at org.apache.hadoop.util.ReflectionUtils.newInstance(* ReflectionUtils.java:117*) at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*) at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*) at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*) at java.security.AccessController.doPrivileged(*Native Method*) at javax.security.auth.Subject.doAs(*Subject.java:396*) at org.apache.hadoop.security.UserGroupInformation.doAs(* UserGroupInformation.java:1157*) at org.apache.hadoop.mapred.Child.main(*Child.java:264*) Caused
Re: Invocation exception
I commented reducer and combiner both and still I see the same exception. Could it be because I have 2 jars being added? On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.com wrote: On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com wrote: For some reason I am getting invocation exception and I don't see any more details other than this exception: My job is configured as: JobConf conf = *new* JobConf(FormMLProcessor.*class*); conf.addResource(hdfs-site.xml); conf.addResource(core-site.xml); conf.addResource(mapred-site.xml); conf.set(mapred.reduce.tasks, 0); conf.setJobName(mlprocessor); DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar), conf); DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar), conf); conf.setOutputKeyClass(Text.*class*); conf.setOutputValueClass(Text.*class*); conf.setMapperClass(Map.*class*); conf.setCombinerClass(Reduce.*class*); conf.setReducerClass(IdentityReducer.*class*); Why would you set the Reducer when the number of reducers is set to zero. Not sure if this is the real cause. conf.setInputFormat(SequenceFileAsTextInputFormat.*class*); conf.setOutputFormat(TextOutputFormat.*class*); FileInputFormat.*setInputPaths*(conf, *new* Path(args[0])); FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1])); JobClient.*runJob*(conf); - * java.lang.RuntimeException*: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(* ReflectionUtils.java:93*) at org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*) at org.apache.hadoop.util.ReflectionUtils.newInstance(* ReflectionUtils.java:117*) at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*) at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*) at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*) at java.security.AccessController.doPrivileged(*Native Method*) at javax.security.auth.Subject.doAs(*Subject.java:396*) at org.apache.hadoop.security.UserGroupInformation.doAs(* UserGroupInformation.java:1157*) at org.apache.hadoop.mapred.Child.main(*Child.java:264*) Caused by: *java.lang.reflect.InvocationTargetException * at sun.reflect.NativeMethodAccessorImpl.invoke0(*Native Method*) at sun.reflect.NativeMethodAccessorImpl.invoke(* NativeMethodAccessorImpl.java:39*) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
Re: Invocation exception
It looks like adding this line causes invocation exception. I looked in hdfs and I see that file in that path DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar), conf); I have similar code for another jar DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar), conf); but this works just fine. On Tue, Feb 28, 2012 at 11:44 AM, Mohit Anchlia mohitanch...@gmail.comwrote: I commented reducer and combiner both and still I see the same exception. Could it be because I have 2 jars being added? On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.comwrote: On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com wrote: For some reason I am getting invocation exception and I don't see any more details other than this exception: My job is configured as: JobConf conf = *new* JobConf(FormMLProcessor.*class*); conf.addResource(hdfs-site.xml); conf.addResource(core-site.xml); conf.addResource(mapred-site.xml); conf.set(mapred.reduce.tasks, 0); conf.setJobName(mlprocessor); DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar), conf); DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar), conf); conf.setOutputKeyClass(Text.*class*); conf.setOutputValueClass(Text.*class*); conf.setMapperClass(Map.*class*); conf.setCombinerClass(Reduce.*class*); conf.setReducerClass(IdentityReducer.*class*); Why would you set the Reducer when the number of reducers is set to zero. Not sure if this is the real cause. conf.setInputFormat(SequenceFileAsTextInputFormat.*class*); conf.setOutputFormat(TextOutputFormat.*class*); FileInputFormat.*setInputPaths*(conf, *new* Path(args[0])); FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1])); JobClient.*runJob*(conf); - * java.lang.RuntimeException*: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(* ReflectionUtils.java:93*) at org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*) at org.apache.hadoop.util.ReflectionUtils.newInstance(* ReflectionUtils.java:117*) at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*) at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*) at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*) at java.security.AccessController.doPrivileged(*Native Method*) at javax.security.auth.Subject.doAs(*Subject.java:396*) at org.apache.hadoop.security.UserGroupInformation.doAs(* UserGroupInformation.java:1157*) at org.apache.hadoop.mapred.Child.main(*Child.java:264*) Caused by: *java.lang.reflect.InvocationTargetException * at sun.reflect.NativeMethodAccessorImpl.invoke0(*Native Method*) at sun.reflect.NativeMethodAccessorImpl.invoke(* NativeMethodAccessorImpl.java:39*) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
100x slower mapreduce compared to pig
I am comparing runtime of similar logic. The entire logic is exactly same but surprisingly map reduce job that I submit is 100x slow. For pig I use udf and for hadoop I use mapper only and the logic same as pig. Even the splits on the admin page are same. Not sure why it's so slow. I am submitting job like: java -classpath .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar com.services.dp.analytics.hadoop.mapred.FormMLProcessor /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq /examples/output1/ How should I go about looking the root cause of why it's so slow? Any suggestions would be really appreciated. One of the things I noticed is that on the admin page of map task list I see status as hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728 but for pig the status is blank.
Re: dfs.block.size
Can someone please suggest if parameters like dfs.block.size, mapred.tasktracker.map.tasks.maximum are only cluster wide settings or can these be set per client job configuration? On Sat, Feb 25, 2012 at 5:43 PM, Mohit Anchlia mohitanch...@gmail.comwrote: If I want to change the block size then can I use Configuration in mapreduce job and set it when writing to the sequence file or does it need to be cluster wide setting in .xml files? Also, is there a way to check the block of a given file?
Task Killed but no errors
I submitted a map reduce job that had 9 tasks killed out of 139. But I don't see any errors in the admin page. The entire job however has SUCCEDED. How can I track down the reason? Also, how do I determine if this is something to worry about?
Re: dfs.block.size
How do I verify the block size of a given file? Is there a command? On Mon, Feb 27, 2012 at 7:59 AM, Joey Echeverria j...@cloudera.com wrote: dfs.block.size can be set per job. mapred.tasktracker.map.tasks.maximum is per tasktracker. -Joey On Mon, Feb 27, 2012 at 10:19 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Can someone please suggest if parameters like dfs.block.size, mapred.tasktracker.map.tasks.maximum are only cluster wide settings or can these be set per client job configuration? On Sat, Feb 25, 2012 at 5:43 PM, Mohit Anchlia mohitanch...@gmail.com wrote: If I want to change the block size then can I use Configuration in mapreduce job and set it when writing to the sequence file or does it need to be cluster wide setting in .xml files? Also, is there a way to check the block of a given file? -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Handling bad records
What's the best way to write records to a different file? I am doing xml processing and during processing I might come accross invalid xml format. Current I have it under try catch block and writing to log4j. But I think it would be better to just write it to an output file that just contains errors.
Re: Invocation exception
Does it matter if reducer is set even if the no of reducers is 0? Is there a way to get more clear reason? On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.com wrote: On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com wrote: For some reason I am getting invocation exception and I don't see any more details other than this exception: My job is configured as: JobConf conf = *new* JobConf(FormMLProcessor.*class*); conf.addResource(hdfs-site.xml); conf.addResource(core-site.xml); conf.addResource(mapred-site.xml); conf.set(mapred.reduce.tasks, 0); conf.setJobName(mlprocessor); DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar), conf); DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar), conf); conf.setOutputKeyClass(Text.*class*); conf.setOutputValueClass(Text.*class*); conf.setMapperClass(Map.*class*); conf.setCombinerClass(Reduce.*class*); conf.setReducerClass(IdentityReducer.*class*); Why would you set the Reducer when the number of reducers is set to zero. Not sure if this is the real cause. conf.setInputFormat(SequenceFileAsTextInputFormat.*class*); conf.setOutputFormat(TextOutputFormat.*class*); FileInputFormat.*setInputPaths*(conf, *new* Path(args[0])); FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1])); JobClient.*runJob*(conf); - * java.lang.RuntimeException*: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(* ReflectionUtils.java:93*) at org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*) at org.apache.hadoop.util.ReflectionUtils.newInstance(* ReflectionUtils.java:117*) at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*) at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*) at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*) at java.security.AccessController.doPrivileged(*Native Method*) at javax.security.auth.Subject.doAs(*Subject.java:396*) at org.apache.hadoop.security.UserGroupInformation.doAs(* UserGroupInformation.java:1157*) at org.apache.hadoop.mapred.Child.main(*Child.java:264*) Caused by: *java.lang.reflect.InvocationTargetException * at sun.reflect.NativeMethodAccessorImpl.invoke0(*Native Method*) at sun.reflect.NativeMethodAccessorImpl.invoke(* NativeMethodAccessorImpl.java:39*) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
Re: Invocation exception
On Mon, Feb 27, 2012 at 8:58 PM, Prashant Kommireddi prash1...@gmail.comwrote: Tom White's Definitive Guide book is a great reference. Answers to most of your questions could be found there. I've been through that book but haven't come accross how to debug this exception. Can you point me to the topic in that book where I'll find this information? Sent from my iPhone On Feb 27, 2012, at 8:54 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Does it matter if reducer is set even if the no of reducers is 0? Is there a way to get more clear reason? On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.com wrote: On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com wrote: For some reason I am getting invocation exception and I don't see any more details other than this exception: My job is configured as: JobConf conf = *new* JobConf(FormMLProcessor.*class*); conf.addResource(hdfs-site.xml); conf.addResource(core-site.xml); conf.addResource(mapred-site.xml); conf.set(mapred.reduce.tasks, 0); conf.setJobName(mlprocessor); DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar), conf); DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar), conf); conf.setOutputKeyClass(Text.*class*); conf.setOutputValueClass(Text.*class*); conf.setMapperClass(Map.*class*); conf.setCombinerClass(Reduce.*class*); conf.setReducerClass(IdentityReducer.*class*); Why would you set the Reducer when the number of reducers is set to zero. Not sure if this is the real cause. conf.setInputFormat(SequenceFileAsTextInputFormat.*class*); conf.setOutputFormat(TextOutputFormat.*class*); FileInputFormat.*setInputPaths*(conf, *new* Path(args[0])); FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1])); JobClient.*runJob*(conf); - * java.lang.RuntimeException*: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(* ReflectionUtils.java:93*) at org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*) at org.apache.hadoop.util.ReflectionUtils.newInstance(* ReflectionUtils.java:117*) at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*) at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*) at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*) at java.security.AccessController.doPrivileged(*Native Method*) at javax.security.auth.Subject.doAs(*Subject.java:396*) at org.apache.hadoop.security.UserGroupInformation.doAs(* UserGroupInformation.java:1157*) at org.apache.hadoop.mapred.Child.main(*Child.java:264*) Caused by: *java.lang.reflect.InvocationTargetException * at sun.reflect.NativeMethodAccessorImpl.invoke0(*Native Method*) at sun.reflect.NativeMethodAccessorImpl.invoke(* NativeMethodAccessorImpl.java:39*) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
Re: Handling bad records
Thanks that's helpful. In that example what is A and B referring to? Is that the output file name? mos.getCollector(seq, A, reporter).collect(key, new Text(Bye)); mos.getCollector(seq, B, reporter).collect(key, new Text(Chau)); On Mon, Feb 27, 2012 at 9:53 PM, Harsh J ha...@cloudera.com wrote: Mohit, Use the MultipleOutputs API: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html to have a named output of bad records. There is an example of use detailed on the link. On Tue, Feb 28, 2012 at 3:48 AM, Mohit Anchlia mohitanch...@gmail.com wrote: What's the best way to write records to a different file? I am doing xml processing and during processing I might come accross invalid xml format. Current I have it under try catch block and writing to log4j. But I think it would be better to just write it to an output file that just contains errors. -- Harsh J
Re: LZO with sequenceFile
On Sun, Feb 26, 2012 at 9:09 AM, Harsh J ha...@cloudera.com wrote: If you want to just quickly package the hadoop-lzo items instead of building/managing-deployment on your own, you can reuse Todd Lipcon's script at https://github.com/toddlipcon/hadoop-lzo-packager - Creates both RPMs and DEBs. Thanks! Some questions I have is: 1. Would it work with sequence files? I am using SequenceFileAsTextInputStream 2. If I use SequenceFile.CompressionType.RECORD or BLOCK would it still split the files? 3. I am also using CDH's 20.2 version of hadoop. On Sun, Feb 26, 2012 at 9:55 PM, Ioan Eugen Stan stan.ieu...@gmail.com wrote: 2012/2/26 Mohit Anchlia mohitanch...@gmail.com: Thanks. Does it mean LZO is not installed by default? How can I install LZO? The LZO library is released under GPL and I believe it can't be included in most distributions of Hadoop because of this (can't mix GPL with non GPL stuff). It should be easily available though. On Sat, Feb 25, 2012 at 6:27 PM, Shi Yu sh...@uchicago.edu wrote: Yes, it is supported by Hadoop sequence file. It is splittable by default. If you have installed and specified LZO correctly, use these: org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma t.setCompressOutput(job,true); org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma t.setOutputCompressorClass(job,com.hadoop.compression.lzo.LzoC odec.class); org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma t.setOutputCompressionType(job, SequenceFile.CompressionType.BLOCK); job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.outpu t.SequenceFileOutputFormat.class); Shi -- Ioan Eugen Stan http://ieugen.blogspot.com/ -- Harsh J
dfs.block.size
If I want to change the block size then can I use Configuration in mapreduce job and set it when writing to the sequence file or does it need to be cluster wide setting in .xml files? Also, is there a way to check the block of a given file?
Re: LZO with sequenceFile
Thanks. Does it mean LZO is not installed by default? How can I install LZO? On Sat, Feb 25, 2012 at 6:27 PM, Shi Yu sh...@uchicago.edu wrote: Yes, it is supported by Hadoop sequence file. It is splittable by default. If you have installed and specified LZO correctly, use these: org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma t.setCompressOutput(job,true); org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma t.setOutputCompressorClass(job,com.hadoop.compression.lzo.LzoC odec.class); org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma t.setOutputCompressionType(job, SequenceFile.CompressionType.BLOCK); job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.outpu t.SequenceFileOutputFormat.class); Shi
MapReduce tunning
I am looking at some hadoop tuning parameters like io.sort.mb, mapred.child.javaopts etc. - My question was where to look at for current setting - Are these settings configured cluster wide or per job? - What's the best way to look at reasons of slow performance?
Re: Splitting files on new line using hadoop fs
On Wed, Feb 22, 2012 at 12:23 PM, bejoy.had...@gmail.com wrote: Hi Mohit AFAIK there is no default mechanism available for the same in hadoop. File is split into blocks just based on the configured block size during hdfs copy. While processing the file using Mapreduce the record reader takes care of the new lines even if a line spans across multiple blocks. Could you explain more on the use case that demands such a requirement while hdfs copy itself? I am using pig's XMLLoader in piggybank to read xml files concatenated in a text file. But pig script doesn't work when file is big that causes hadoop to split the files. Any suggestions on how I can make it work? Below is my simple script that I would like to enhance, only if it starts working. Please note this works for small files. register '/root/pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar' raw = LOAD '/examples/testfile5.txt using org.apache.pig.piggybank.storage.XMLLoader('abc') as (document:chararray); dump raw; --Original Message-- From: Mohit Anchlia To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: Splitting files on new line using hadoop fs Sent: Feb 23, 2012 01:45 How can I copy large text files using hadoop fs such that split occurs based on blocks + new lines instead of blocks alone? Is there a way to do this? Regards Bejoy K S From handheld, Please excuse typos.
Re: Splitting files on new line using hadoop fs
Thanks I did post this question to that group. All xml document are separated by a new line so that shouldn't be the issue, I think. On Wed, Feb 22, 2012 at 12:44 PM, bejoy.had...@gmail.com wrote: ** Hi Mohit I'm not an expert in pig and it'd be better using the pig user group for pig specific queries. I'd try to help you with some basic trouble shooting of the same It sounds strange that pig's XML Loader can't load larger XML files that consists of multiple blocks. Or is it like, pig is not able to load the concatenated files that you are trying with? If that is the case then it could be because of some issues since you are just appending multiple xml file contents into a single file. Pig users can give you some workarounds how they are dealing with loading of small xml files that are stored efficiently. Regards Bejoy K S From handheld, Please excuse typos. -- *From: *Mohit Anchlia mohitanch...@gmail.com *Date: *Wed, 22 Feb 2012 12:29:26 -0800 *To: *common-user@hadoop.apache.org; bejoy.had...@gmail.com *Subject: *Re: Splitting files on new line using hadoop fs On Wed, Feb 22, 2012 at 12:23 PM, bejoy.had...@gmail.com wrote: Hi Mohit AFAIK there is no default mechanism available for the same in hadoop. File is split into blocks just based on the configured block size during hdfs copy. While processing the file using Mapreduce the record reader takes care of the new lines even if a line spans across multiple blocks. Could you explain more on the use case that demands such a requirement while hdfs copy itself? I am using pig's XMLLoader in piggybank to read xml files concatenated in a text file. But pig script doesn't work when file is big that causes hadoop to split the files. Any suggestions on how I can make it work? Below is my simple script that I would like to enhance, only if it starts working. Please note this works for small files. register '/root/pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar' raw = LOAD '/examples/testfile5.txt using org.apache.pig.piggybank.storage.XMLLoader('abc') as (document:chararray); dump raw; --Original Message-- From: Mohit Anchlia To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: Splitting files on new line using hadoop fs Sent: Feb 23, 2012 01:45 How can I copy large text files using hadoop fs such that split occurs based on blocks + new lines instead of blocks alone? Is there a way to do this? Regards Bejoy K S From handheld, Please excuse typos.
Streaming job hanging
Streaming job just seems to be hanging 12/02/22 17:35:50 INFO streaming.StreamJob: map 0% reduce 0% - On the admin page I see that it created 551 input split. Could somone suggest a way to find out what might be causing it to hang? I increased io.sort.mb to 200 MB. I am using 5 data nodes with 12 CPU, 96G RAM.
Re: Writing small files to one big file in hdfs
On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Mohit Rather than just appending the content into a normal text file or so, you can create a sequence file with the individual smaller file content as values. Thanks. I was planning to use pig's org.apache.pig.piggybank.storage.XMLLoader for processing. Would it work with sequence file? This text file that I was referring to would be in hdfs itself. Is it still different than using sequence file? Regards Bejoy.K.S On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia mohitanch...@gmail.com wrote: We have small xml files. Currently I am planning to append these small files to one file in hdfs so that I can take advantage of splits, larger blocks and sequential IO. What I am unsure is if it's ok to append one file at a time to this hdfs file Could someone suggest if this is ok? Would like to know how other do it.
Re: Writing small files to one big file in hdfs
I am trying to look for examples that demonstrates using sequence files including writing to it and then running mapred on it, but unable to find one. Could you please point me to some examples of sequence files? On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Mohit AFAIK XMLLoader in pig won't be suited for Sequence Files. Please post the same to Pig user group for some workaround over the same. SequenceFIle is a preferred option when we want to store small files in hdfs and needs to be processed by MapReduce as it stores data in key value format.Since SequenceFileInputFormat is available at your disposal you don't need any custom input formats for processing the same using map reduce. It is a cleaner and better approach compared to just appending small xml file contents into a big file. On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia mohitanch...@gmail.com wrote: On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Mohit Rather than just appending the content into a normal text file or so, you can create a sequence file with the individual smaller file content as values. Thanks. I was planning to use pig's org.apache.pig.piggybank.storage.XMLLoader for processing. Would it work with sequence file? This text file that I was referring to would be in hdfs itself. Is it still different than using sequence file? Regards Bejoy.K.S On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia mohitanch...@gmail.com wrote: We have small xml files. Currently I am planning to append these small files to one file in hdfs so that I can take advantage of splits, larger blocks and sequential IO. What I am unsure is if it's ok to append one file at a time to this hdfs file Could someone suggest if this is ok? Would like to know how other do it.
Re: Writing small files to one big file in hdfs
Thanks How does mapreduce work on sequence file? Is there an example I can look at? On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee arkoprovomukher...@gmail.com wrote: Hi, Let's say all the smaller files are in the same directory. Then u can do: *BufferedWriter output = new BufferedWriter (newOutputStreamWriter(fs.create(output_path, true))); // Output path* *FileStatus[] output_files = fs.listStatus(new Path(input_path)); // Input directory* *for ( int i=0; i output_files.length; i++ ) * *{* * BufferedReader reader = new BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(; * * String data;* * data = reader.readLine();* * while ( data != null ) * * {* *output.write(data);* * }* *reader.close* *}* *output.close* In case you have the files in multiple directories, call the code for each of them with different input paths. Hope this helps! Cheers Arko On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying to look for examples that demonstrates using sequence files including writing to it and then running mapred on it, but unable to find one. Could you please point me to some examples of sequence files? On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Mohit AFAIK XMLLoader in pig won't be suited for Sequence Files. Please post the same to Pig user group for some workaround over the same. SequenceFIle is a preferred option when we want to store small files in hdfs and needs to be processed by MapReduce as it stores data in key value format.Since SequenceFileInputFormat is available at your disposal you don't need any custom input formats for processing the same using map reduce. It is a cleaner and better approach compared to just appending small xml file contents into a big file. On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia mohitanch...@gmail.com wrote: On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Mohit Rather than just appending the content into a normal text file or so, you can create a sequence file with the individual smaller file content as values. Thanks. I was planning to use pig's org.apache.pig.piggybank.storage.XMLLoader for processing. Would it work with sequence file? This text file that I was referring to would be in hdfs itself. Is it still different than using sequence file? Regards Bejoy.K.S On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia mohitanch...@gmail.com wrote: We have small xml files. Currently I am planning to append these small files to one file in hdfs so that I can take advantage of splits, larger blocks and sequential IO. What I am unsure is if it's ok to append one file at a time to this hdfs file Could someone suggest if this is ok? Would like to know how other do it.
Re: Writing to SequenceFile fails
I am past this error. Looks like I needed to use CDH libraries. I changed my maven repo. Now I am stuck at *org.apache.hadoop.security.AccessControlException *since I am not writing as user that owns the file. Looking online for solutions On Tue, Feb 21, 2012 at 12:48 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I am trying to write to the sequence file and it seems to be failing. Not sure why, Is there something I need to do String uri=hdfs://db1:54310/examples/testfile1.seq; FileSystem fs = FileSystem.*get*(URI.*create*(uri), conf); //Fails on this line Caused by: *java.io.EOFException* at java.io.DataInputStream.readInt( *DataInputStream.java:375*) at org.apache.hadoop.ipc.Client$Connection.receiveResponse( *Client.java:501*) at org.apache.hadoop.ipc.Client$Connection.run(*Client.java:446*)
Re: Writing small files to one big file in hdfs
Need some more help. I wrote sequence file using below code but now when I run mapreduce job I get file.*java.lang.ClassCastException*: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text even though I didn't use LongWritable when I originally wrote to the sequence //Code to write to the sequence file. There is no LongWritable here org.apache.hadoop.io.Text key = *new* org.apache.hadoop.io.Text(); BufferedReader buffer = *new* BufferedReader(*new* FileReader(filePath)); String line = *null*; org.apache.hadoop.io.Text value = *new* org.apache.hadoop.io.Text(); *try* { writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(), value.getClass(), SequenceFile.CompressionType.*RECORD*); *int* i = 1; *long* timestamp=System.*currentTimeMillis*(); *while* ((line = buffer.readLine()) != *null*) { key.set(String.*valueOf*(timestamp)); value.set(line); writer.append(key, value); i++; } On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee arkoprovomukher...@gmail.com wrote: Hi, I think the following link will help: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html Cheers Arko On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Sorry may be it's something obvious but I was wondering when map or reduce gets called what would be the class used for key and value? If I used org.apache.hadoop.io.Text value = *new* org.apache.hadoop.io.Text(); would the map be called with Text class? public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee arkoprovomukher...@gmail.com wrote: Hi Mohit, I am not sure that I understand your question. But you can write into a file using: *BufferedWriter output = new BufferedWriter (new OutputStreamWriter(fs.create(my_path,true)));* *output.write(data);* * * Then you can pass that file as the input to your MapReduce program. *FileInputFormat.addInputPath(jobconf, new Path (my_path) );* From inside your Map/Reduce methods, I think you should NOT be tinkering with the input / output paths of that Map/Reduce job. Cheers Arko On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Thanks How does mapreduce work on sequence file? Is there an example I can look at? On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee arkoprovomukher...@gmail.com wrote: Hi, Let's say all the smaller files are in the same directory. Then u can do: *BufferedWriter output = new BufferedWriter (newOutputStreamWriter(fs.create(output_path, true))); // Output path* *FileStatus[] output_files = fs.listStatus(new Path(input_path)); // Input directory* *for ( int i=0; i output_files.length; i++ ) * *{* * BufferedReader reader = new BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(; * * String data;* * data = reader.readLine();* * while ( data != null ) * * {* *output.write(data);* * }* *reader.close* *}* *output.close* In case you have the files in multiple directories, call the code for each of them with different input paths. Hope this helps! Cheers Arko On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying to look for examples that demonstrates using sequence files including writing to it and then running mapred on it, but unable to find one. Could you please point me to some examples of sequence files? On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Mohit AFAIK XMLLoader in pig won't be suited for Sequence Files. Please post the same to Pig user group for some workaround over the same. SequenceFIle is a preferred option when we want to store small files in hdfs and needs to be processed by MapReduce as it stores data in key value format.Since SequenceFileInputFormat is available at your disposal you don't need any custom input formats for processing the same using map reduce. It is a cleaner and better approach compared to just appending small xml file contents into a big file. On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia mohitanch...@gmail.com wrote: On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Mohit Rather than just appending the content into a normal text file or so, you can create
Re: Writing small files to one big file in hdfs
It looks like in mapper values are coming as binary instead of Text. Is this expected from sequence file? I initially wrote SequenceFile with Text values. On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Need some more help. I wrote sequence file using below code but now when I run mapreduce job I get file.*java.lang.ClassCastException*: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text even though I didn't use LongWritable when I originally wrote to the sequence //Code to write to the sequence file. There is no LongWritable here org.apache.hadoop.io.Text key = *new* org.apache.hadoop.io.Text(); BufferedReader buffer = *new* BufferedReader(*new* FileReader(filePath)); String line = *null*; org.apache.hadoop.io.Text value = *new* org.apache.hadoop.io.Text(); *try* { writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(), value.getClass(), SequenceFile.CompressionType. *RECORD*); *int* i = 1; *long* timestamp=System.*currentTimeMillis*(); *while* ((line = buffer.readLine()) != *null*) { key.set(String.*valueOf*(timestamp)); value.set(line); writer.append(key, value); i++; } On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee arkoprovomukher...@gmail.com wrote: Hi, I think the following link will help: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html Cheers Arko On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Sorry may be it's something obvious but I was wondering when map or reduce gets called what would be the class used for key and value? If I used org.apache.hadoop.io.Text value = *new* org.apache.hadoop.io.Text(); would the map be called with Text class? public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee arkoprovomukher...@gmail.com wrote: Hi Mohit, I am not sure that I understand your question. But you can write into a file using: *BufferedWriter output = new BufferedWriter (new OutputStreamWriter(fs.create(my_path,true)));* *output.write(data);* * * Then you can pass that file as the input to your MapReduce program. *FileInputFormat.addInputPath(jobconf, new Path (my_path) );* From inside your Map/Reduce methods, I think you should NOT be tinkering with the input / output paths of that Map/Reduce job. Cheers Arko On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Thanks How does mapreduce work on sequence file? Is there an example I can look at? On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee arkoprovomukher...@gmail.com wrote: Hi, Let's say all the smaller files are in the same directory. Then u can do: *BufferedWriter output = new BufferedWriter (newOutputStreamWriter(fs.create(output_path, true))); // Output path* *FileStatus[] output_files = fs.listStatus(new Path(input_path)); // Input directory* *for ( int i=0; i output_files.length; i++ ) * *{* * BufferedReader reader = new BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(; * * String data;* * data = reader.readLine();* * while ( data != null ) * * {* *output.write(data);* * }* *reader.close* *}* *output.close* In case you have the files in multiple directories, call the code for each of them with different input paths. Hope this helps! Cheers Arko On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying to look for examples that demonstrates using sequence files including writing to it and then running mapred on it, but unable to find one. Could you please point me to some examples of sequence files? On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Mohit AFAIK XMLLoader in pig won't be suited for Sequence Files. Please post the same to Pig user group for some workaround over the same. SequenceFIle is a preferred option when we want to store small files in hdfs and needs to be processed by MapReduce as it stores data in key value format.Since SequenceFileInputFormat is available at your disposal you don't need any custom input formats for processing the same using map reduce. It is a cleaner and better approach compared to just appending small xml file contents into a big file. On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia mohitanch
Re: Writing small files to one big file in hdfs
Finally figured it out. I needed to use SequenceFileAstextInputFormat. There is just lack of examples that makes it difficult when you start. On Tue, Feb 21, 2012 at 4:50 PM, Mohit Anchlia mohitanch...@gmail.comwrote: It looks like in mapper values are coming as binary instead of Text. Is this expected from sequence file? I initially wrote SequenceFile with Text values. On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Need some more help. I wrote sequence file using below code but now when I run mapreduce job I get file.*java.lang.ClassCastException*: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text even though I didn't use LongWritable when I originally wrote to the sequence //Code to write to the sequence file. There is no LongWritable here org.apache.hadoop.io.Text key = *new* org.apache.hadoop.io.Text(); BufferedReader buffer = *new* BufferedReader(*new* FileReader(filePath)); String line = *null*; org.apache.hadoop.io.Text value = *new* org.apache.hadoop.io.Text(); *try* { writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(), value.getClass(), SequenceFile.CompressionType. *RECORD*); *int* i = 1; *long* timestamp=System.*currentTimeMillis*(); *while* ((line = buffer.readLine()) != *null*) { key.set(String.*valueOf*(timestamp)); value.set(line); writer.append(key, value); i++; } On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee arkoprovomukher...@gmail.com wrote: Hi, I think the following link will help: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html Cheers Arko On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Sorry may be it's something obvious but I was wondering when map or reduce gets called what would be the class used for key and value? If I used org.apache.hadoop.io.Text value = *new* org.apache.hadoop.io.Text(); would the map be called with Text class? public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee arkoprovomukher...@gmail.com wrote: Hi Mohit, I am not sure that I understand your question. But you can write into a file using: *BufferedWriter output = new BufferedWriter (new OutputStreamWriter(fs.create(my_path,true)));* *output.write(data);* * * Then you can pass that file as the input to your MapReduce program. *FileInputFormat.addInputPath(jobconf, new Path (my_path) );* From inside your Map/Reduce methods, I think you should NOT be tinkering with the input / output paths of that Map/Reduce job. Cheers Arko On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Thanks How does mapreduce work on sequence file? Is there an example I can look at? On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee arkoprovomukher...@gmail.com wrote: Hi, Let's say all the smaller files are in the same directory. Then u can do: *BufferedWriter output = new BufferedWriter (newOutputStreamWriter(fs.create(output_path, true))); // Output path* *FileStatus[] output_files = fs.listStatus(new Path(input_path)); // Input directory* *for ( int i=0; i output_files.length; i++ ) * *{* * BufferedReader reader = new BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(; * * String data;* * data = reader.readLine();* * while ( data != null ) * * {* *output.write(data);* * }* *reader.close* *}* *output.close* In case you have the files in multiple directories, call the code for each of them with different input paths. Hope this helps! Cheers Arko On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying to look for examples that demonstrates using sequence files including writing to it and then running mapred on it, but unable to find one. Could you please point me to some examples of sequence files? On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Mohit AFAIK XMLLoader in pig won't be suited for Sequence Files. Please post the same to Pig user group for some workaround over the same. SequenceFIle is a preferred option when we want to store small files in hdfs and needs to be processed by MapReduce as it stores data in key value format.Since SequenceFileInputFormat is available at your disposal you don't need any custom input formats for processing
Re: Processing small xml files
On Fri, Feb 17, 2012 at 11:37 PM, Srinivas Surasani vas...@gmail.comwrote: Hi Mohit, You can use Pig for processing XML files. PiggyBank has build in load function to load the XML files. Also you can specify pig.maxCombinedSplitSize and pig.splitCombination for efficient processing. I can't seem to find examples of how to do xml processing in Pig. Can you please send me some pointers? Basically I need to convert my xml to more structured format using hadoop to write it to database. On Sat, Feb 18, 2012 at 1:18 AM, Mohit Anchlia mohitanch...@gmail.com wrote: On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill bill...@gmail.com wrote: I'm not sure what you mean by flat format here. In my scenario, I have an file input.xml that looks like this. myfile section value1/value /section section value2/value /section /myfile input.xml is a plain text file. Not a sequence file. If I read it with the XMLInputFormat my mapper gets called with (key, value) pairs that look like this: (, sectionvalue1/value/section) (, sectionvalue2/value/section) Where the keys are numerical offsets into the file. I then use this information to write a sequence file with these (key, value) pairs. So my Hadoop job that uses XMLInputFormat takes a text file as input and produces a sequence file as output. I don't know a rule of thumb for how many small files is too many. Maybe someone else on the list can chime in. I just know that when your throughput gets slow that's one possible cause to investigate. I need to install hadoop. Does this xmlinput format comes as part of the install? Can you please give me some pointers that would help me install hadoop and xmlinputformat if necessary? -- -- Srinivas srini...@cloudwick.com
Re: Hadoop install
Thanks Do I have to do something special to get Mahout xmlinput format and Pig with the new release of hadoop? On Sat, Feb 18, 2012 at 6:42 AM, Tom Deutsch tdeut...@us.ibm.com wrote: Mohit - one place to start is here; http://hadoop.apache.org/common/releases.html#Download The release notes, as always, are well worth reading. Tom Deutsch Program Director Information Management Big Data Technologies IBM 3565 Harbor Blvd Costa Mesa, CA 92626-1420 tdeut...@us.ibm.com Mohit Anchlia mohitanch...@gmail.com 02/18/2012 06:24 AM Please respond to common-user@hadoop.apache.org To common-user@hadoop.apache.org cc Subject Hadoop install What's the best way or guide to install latest hadoop. Is the latest Hadoop still .20 which comes up in google search. Could someone guide me with the latest hadoop distribution. I also need pig and mahout xmlinputformat.
Re: Processing small xml files
On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill bill...@gmail.com wrote: I'm not sure what you mean by flat format here. In my scenario, I have an file input.xml that looks like this. myfile section value1/value /section section value2/value /section /myfile input.xml is a plain text file. Not a sequence file. If I read it with the XMLInputFormat my mapper gets called with (key, value) pairs that look like this: (, sectionvalue1/value/section) (, sectionvalue2/value/section) Where the keys are numerical offsets into the file. I then use this information to write a sequence file with these (key, value) pairs. So my Hadoop job that uses XMLInputFormat takes a text file as input and produces a sequence file as output. I don't know a rule of thumb for how many small files is too many. Maybe someone else on the list can chime in. I just know that when your throughput gets slow that's one possible cause to investigate. I need to install hadoop. Does this xmlinput format comes as part of the install? Can you please give me some pointers that would help me install hadoop and xmlinputformat if necessary?
Re: Processing small xml files
On Sun, Feb 12, 2012 at 9:24 AM, W.P. McNeill bill...@gmail.com wrote: I've used the Mahout XMLInputFormat. It is the right tool if you have an XML file with one type of section repeated over and over again and want to turn that into Sequence file where each repeated section is a value. I've found it helpful as a preprocessing step for converting raw XML input into something that can be handled by Hadoop jobs. Thanks for the input. Do you first convert it into flat format and then run another hadoop job or do you just read xml sequence file and then perform reduce on that. Is there an advantage of first converting it into a flat file format? If you're worried about having lots of small files--specifically, about overwhelming your namenode because you have too many small files--the XMLInputFormat won't help with that. However, it may be possible to concatenate the small files into larger files, then have a Hadoop job that uses XMLInputFormat transform the large files into sequence files. How many are too many for namenode? We have around 100M files and 100M files every year
Developing MapReduce
I use eclipse. Is this http://wiki.apache.org/hadoop/EclipsePlugIn still the best way to develop mapreduce programs in hadoop? Just want to make sure before I go down this path. Or should I just add hadoop jars in my classpath of eclipse and create my own MapReduce programs. Thanks
Re: incremental loads into hadoop
This process of managing looks like more pain long term. Would it be easier to store in Hbase which has smaller block size? What's the avg. file size? On Sun, Oct 2, 2011 at 7:34 PM, Vitthal Suhas Gogate gog...@hortonworks.com wrote: Agree with Bejoy, although to minimize the processing latency you can still choose to write more frequently to HDFS resulting into more number of smaller size files on HDFS rather than waiting to accumulate large size data before writing to HDFS. As you may have more number of smaller files, it may be good to use combine file input format to not have large number of very small map tasks (one per file if less than block size). Now after you process the input data, you may not want to leave these large number of small files on HDFS and hence you can use a Hadoop Archive (HAR) tool to combine and store them into small number of bigger size files.. You can run this tool periodically in the background to archive the input that is already processed.. Archive tool itself is implemented as M/R job. Also to get some level of atomicity, you may copy the data to HDFS at a temporary location before moving it to final source partition (or directory). Existing data loading tools may be doing that already. --Suhas Gogate On Sun, Oct 2, 2011 at 11:12 AM, bejoy.had...@gmail.com wrote: Sam Your understanding is right, hadoop definitely works great with large volume of data. But not necessarily every file should be in the range of Giga,Tera or Peta bytes. Mostly when said hadoop process tera bytes of data, It is the total data processed by a map reduce job(rather jobs, most use cases uses more than one map reduce job for processing). It can be 10K files that make up the whole data. Why not large number of small files? The over head on the name node in housekeeping all these large amount of meta data(file- block information) would be huge and there is definitely limits to it. But you can store smaller files together in splittable compressed formats. In general It is better to keep your file sizes atleast same or more than your hdfs block size. In default it is 64Mb but larger clusters have higher values as multiples of 64. If your hdfs block size or your file sizes are lesser than the map reduce input split size then it is better using InputFormats like CombinedInput Format or so for MR jobs. Usually the MR input split size is equal to your hdfs block size. In short as a better practice your single file size should be at least equal to one hdfs block size. The approach of keeping a file opened for long to write and then reading the same parallely with a map reduce, I fear it would work. AFAIK it won't. When a write is going on some blocks or the file itself would be locked, not really sure its the full file being locked or not. In short some blocks wouldn't be available for the concurrent Map Reduce Program during its processing. In your case a quick solution that comes to my mind is keep your real time data writing into the flume queue/buffer . Set it to a desired size once the queue gets full the data would be dumped into hdfs. Then as per your requirement you can kick off your jobs. If you are running MR jobs on very high frequency then make sure that for every run you have enough data to process and choose your max number of mappers and reducers effectively and efficiently Then as the last one, I don't think for normal cases you don't need to dump your large volume of data into lfs and then do a copyFromLocal into hdfs. Tools like flume are build for those purposes I guess. I'm not an expert on Flume, you may need to do more reading on the same before implementing. This what I feel on your use case. But let's leave it open for the experts to comment. Hope it helps. Regards Bejoy K S -Original Message- From: Sam Seigal selek...@yahoo.com Sender: saurabh@gmail.com Date: Sat, 1 Oct 2011 15:50:46 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: incremental loads into hadoop Hi Bejoy, Thanks for the response. While reading about Hadoop, I have come across threads where people claim that Hadoop is not a good fit for a large amount of small files. It is good for files that are gigabyes/petabytes in size. If I am doing incremental loads, let's say every hour. Do I need to wait until maybe at the end of the day when enough data has been collected to start off a MapReduce job ? I am wondering if an open file that is continuously being written to can at the same time be used as an input to an M/R job ... Also, let's say I did not want to do a load straight off the DB. The service, when committing a transaction to the OLTP system, sends a message for that transaction to a Hadoop Service that then writes the transaction into HDFS (the services are connected to each other via a persisted queue, hence are eventually consistent, but that is
Re: Binary content
On Thu, Sep 1, 2011 at 1:25 AM, Dieter Plaetinck dieter.plaeti...@intec.ugent.be wrote: On Wed, 31 Aug 2011 08:44:42 -0700 Mohit Anchlia mohitanch...@gmail.com wrote: Does map-reduce work well with binary contents in the file? This binary content is basically some CAD files and map reduce program need to read these files using some proprietry tool extract values and do some processing. Wondering if there are others doing similar type of processing. Best practices etc. yes, it works. you just need to select the right input format. Personally i store all my binary files into a sequencefile (because my binary files are small) Thanks! Is there a specific tutorial I can focus on to see how it could be done? Dieter
Binary content
Does map-reduce work well with binary contents in the file? This binary content is basically some CAD files and map reduce program need to read these files using some proprietry tool extract values and do some processing. Wondering if there are others doing similar type of processing. Best practices etc.
Re: Question about RAID controllers and hadoop
On Thu, Aug 11, 2011 at 3:26 PM, Charles Wimmer cwim...@yahoo-inc.com wrote: We currently use P410s in 12 disk system. Each disk is set up as a RAID0 volume. Performance is at least as good as a bare disk. Can you please share what throughput you see with P410s? Are these SATA or SAS? On 8/11/11 3:23 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: If I read that email chain correctly then they were referring to the classic JBOD vs multiple disks striped together conversation. The conversation that was started here is referring to JBOD vs 1 RAID 0 per disk and the effects of the raid controller on those independent raids. Matt -Original Message- From: Kai Voigt [mailto:k...@123.org] Sent: Thursday, August 11, 2011 5:17 PM To: common-user@hadoop.apache.org Subject: Re: Question about RAID controllers and hadoop Yahoo did some testing 2 years ago: http://markmail.org/message/xmzc45zi25htr7ry But updated benchmark would be interesting to see. Kai Am 12.08.2011 um 00:13 schrieb GOEKE, MATTHEW (AG/1000): My assumption would be that having a set of 4 raid 0 disks would actually be better than having a controller that allowed pure JBOD of 4 disks due to the cache on the controller. If anyone has any personal experience with this I would love to know performance numbers but our infrastructure guy is doing tests on exactly this over the next couple days so I will pass it along once we have it. Matt -Original Message- From: Bharath Mundlapudi [mailto:bharathw...@yahoo.com] Sent: Thursday, August 11, 2011 5:00 PM To: common-user@hadoop.apache.org Subject: Re: Question about RAID controllers and hadoop True, you need a P410 controller. You can create RAID0 for each disk to make it as JBOD. -Bharath From: Koert Kuipers ko...@tresata.com To: common-user@hadoop.apache.org Sent: Thursday, August 11, 2011 2:50 PM Subject: Question about RAID controllers and hadoop Hello all, We are considering using low end HP proliant machines (DL160s and DL180s) for cluster nodes. However with these machines if you want to do more than 4 hard drives then HP puts in a P410 raid controller. We would configure the RAID controller to function as JBOD, by simply creating multiple RAID volumes with one disk. Does anyone have experience with this setup? Is it a good idea, or am i introducing a i/o bottleneck? Thanks for your help! Best, Koert This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations. -- Kai Voigt k...@123.org
Re: maprd vs mapreduce api
On Fri, Aug 5, 2011 at 3:42 PM, Stevens, Keith D. steven...@llnl.gov wrote: The Mapper and Reducer class in org.apache.hadoop.mapreduce implement the identity function. So you should be able to just do conf.setMapperClass(org.apache.hadoop.mapreduce.Mapper.class); conf.setReducerClass(org.apache.hadoop.mapreduce.Reducer.class); without having to implement your own no-op classes. I recommend reading the javadoc for differences between the old api and the new api, for example http://hadoop.apache.org/common/docs/r0.20.2/api/index.html indicates the different functionality of Mapper in the new api and it's dual use as the identity mapper. Sorry for asking on this thread :) Does Definitive Guide 2 cover the new api? Cheers, --Keith On Aug 5, 2011, at 1:15 PM, garpinc wrote: I was following this tutorial on version 0.19.1 http://v-lad.org/Tutorials/Hadoop/23%20-%20create%20the%20project.html I however wanted to use the latest version of api 0.20.2 The original code in tutorial had following lines conf.setMapperClass(org.apache.hadoop.mapred.lib.IdentityMapper.class); conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class); both Identity classes are deprecated.. So seemed the solution was to create mapper and reducer as follows: public static class NOOPMapper extends MapperText, IntWritable, Text, IntWritable{ public void map(Text key, IntWritable value, Context context ) throws IOException, InterruptedException { context.write(key, value); } } public static class NOOPReducer extends ReducerText,IntWritable,Text,IntWritable { private IntWritable result = new IntWritable(); public void reduce(Text key, IterableIntWritable values, Context context ) throws IOException, InterruptedException { context.write(key, result); } } And then with code: Configuration conf = new Configuration(); Job job = new Job(conf, testdriver); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(In)); FileOutputFormat.setOutputPath(job, new Path(Out)); job.setMapperClass(NOOPMapper.class); job.setReducerClass(NOOPReducer.class); job.waitForCompletion(true); However I get this message java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text at TestDriver$NOOPMapper.map(TestDriver.java:1) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) 11/08/01 16:41:01 INFO mapred.JobClient: map 0% reduce 0% 11/08/01 16:41:01 INFO mapred.JobClient: Job complete: job_local_0001 11/08/01 16:41:01 INFO mapred.JobClient: Counters: 0 Can anyone tell me what I need for this to work. Attached is full code.. http://old.nabble.com/file/p32174859/TestDriver.java TestDriver.java -- View this message in context: http://old.nabble.com/maprd-vs-mapreduce-api-tp32174859p32174859.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Hadoop cluster network requirement
Assuming everything is up this solution still will not scale given the latency, tcpip buffers, sliding window etc. See BDP Sent from my iPad On Aug 1, 2011, at 4:57 PM, Michael Segel michael_se...@hotmail.com wrote: Yeah what he said. Its never a good idea. Forget about losing a NN or a Rack, but just losing connectivity between data centers. (It happens more than you think.) Your entire cluster in both data centers go down. Boom! Its a bad design. You're better off doing two different clusters. Is anyone really trying to sell this as a design? That's even more scary. Subject: Re: Hadoop cluster network requirement From: a...@apache.org Date: Sun, 31 Jul 2011 20:28:53 -0700 To: common-user@hadoop.apache.org; saq...@margallacomm.com On Jul 31, 2011, at 7:30 PM, Saqib Jang -- Margalla Communications wrote: Thanks, I'm independently doing some digging into Hadoop networking requirements and had a couple of quick follow-ups. Could I have some specific info on why different data centers cannot be supported for master node and data node comms? Also, what may be the benefits/use cases for such a scenario? Most people who try to put the NN and DNs in different data centers are trying to achieve disaster recovery: one file system in multiple locations. That isn't the way HDFS is designed and it will end in tears. There are multiple problems: 1) no guarantee that one block replica will be each data center (thereby defeating the whole purpose!) 2) assuming one can work out problem 1, during a network break, the NN will lose contact from one half of the DNs, causing a massive network replication storm 3) if one using MR on top of this HDFS, the shuffle will likely kill the network in between (making MR performance pretty dreadful) is going to cause delays for the DN heartbeats 4) I don't even want to think about rebalancing. ... and I'm sure a lot of other problems I'm forgetting at the moment. So don't do it. If you want disaster recovery, set up two completely separate HDFSes and run everything in parallel.
Re: Moving Files to Distributed Cache in MapReduce
Is this what you are looking for? http://hadoop.apache.org/common/docs/current/mapred_tutorial.html search for jobConf On Fri, Jul 29, 2011 at 1:51 PM, Roger Chen rogc...@ucdavis.edu wrote: Thanks for the response! However, I'm having an issue with this line Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf); because conf has private access in org.apache.hadoop.configured On Fri, Jul 29, 2011 at 11:18 AM, Mapred Learn mapred.le...@gmail.comwrote: I hope my previous reply helps... On Fri, Jul 29, 2011 at 11:11 AM, Roger Chen rogc...@ucdavis.edu wrote: After moving it to the distributed cache, how would I call it within my MapReduce program? On Fri, Jul 29, 2011 at 11:09 AM, Mapred Learn mapred.le...@gmail.com wrote: Did you try using -files option in your hadoop jar command as: /usr/bin/hadoop jar jar name main class name -files absolute path of file to be added to distributed cache input dir output dir On Fri, Jul 29, 2011 at 11:05 AM, Roger Chen rogc...@ucdavis.edu wrote: Slight modification: I now know how to add files to the distributed file cache, which can be done via this command placed in the main or run class: DistributedCache.addCacheFile(new URI(/user/hadoop/thefile.dat), conf); However I am still having trouble locating the file in the distributed cache. *How do I call the file path of thefile.dat in the distributed cache as a string?* I am using Hadoop 0.20.2 On Fri, Jul 29, 2011 at 10:26 AM, Roger Chen rogc...@ucdavis.edu wrote: Hi all, Does anybody have examples of how one moves files from the local filestructure/HDFS to the distributed cache in MapReduce? A Google search turned up examples in Pig but not MR. -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center
Re: Replication and failure
On Thu, Jul 28, 2011 at 12:17 AM, Harsh J ha...@cloudera.com wrote: Mohit, I believe Tom's book (Hadoop: The Definitive Guide) covers this precisely well. Perhaps others too. Replication is a best-effort sort of thing. If 2 nodes are all that is available, then two replicas are written and one is left to the replica monitor service to replicate later as possible (leading to an underreplicated write for the moment). The scenario (with default configs) would only fail if you have 0 DataNodes 'available' to write to. Thanks Harsh. I think you answered my question. I thought that replication of 3 is a must. And for that you really need atleast 4 nodes so that if one of the nodes die it can still write to 3 nodes. I am assuming writes to replica nodes are always synchronous and not eventually consistent. Or are you asking about what happens when a DN fails during a write operation? I am assuming there will be some errors in this case. On Thu, Jul 28, 2011 at 5:08 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Just trying to understand what happens if there are 3 nodes with replication set to 3 and one node fails. Does it fail the writes too? If there is a link that I can look at will be great. I tried searching but didn't see any definitive answer. Thanks, Mohit -- Harsh J
Replication and failure
Just trying to understand what happens if there are 3 nodes with replication set to 3 and one node fails. Does it fail the writes too? If there is a link that I can look at will be great. I tried searching but didn't see any definitive answer. Thanks, Mohit
Re: No. of Map and reduce tasks
What if I had multiple files in input directory, hadoop should then fire parallel map jobs? On Thu, May 26, 2011 at 7:21 PM, jagaran das jagaran_...@yahoo.co.in wrote: If you give really low size files, then the use of Big Block Size of Hadoop goes away. Instead try merging files. Hope that helps From: James Seigel ja...@tynt.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Thu, 26 May, 2011 6:04:07 PM Subject: Re: No. of Map and reduce tasks Set input split size really low, you might get something. I'd rather you fire up some nix commands and pack together that file onto itself a bunch if times and the put it back into hdfs and let 'er rip Sent from my mobile. Please excuse the typos. On 2011-05-26, at 4:56 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I think I understand that by last 2 replies :) But my question is can I change this configuration to say split file into 250K so that multiple mappers can be invoked? On Thu, May 26, 2011 at 3:41 PM, James Seigel ja...@tynt.com wrote: have more data for it to process :) On 2011-05-26, at 4:30 PM, Mohit Anchlia wrote: I ran a simple pig script on this file: -rw-r--r-- 1 root root 208348 May 26 13:43 excite-small.log that orders the contents by name. But it only created one mapper. How can I change this to distribute accross multiple machines? On Thu, May 26, 2011 at 3:08 PM, jagaran das jagaran_...@yahoo.co.in wrote: Hi Mohit, No of Maps - It depends on what is the Total File Size / Block Size No of Reducers - You can specify. Regards, Jagaran From: Mohit Anchlia mohitanch...@gmail.com To: common-user@hadoop.apache.org Sent: Thu, 26 May, 2011 2:48:20 PM Subject: No. of Map and reduce tasks How can I tell how the map and reduce tasks were spread accross the cluster? I looked at the jobtracker web page but can't find that info. Also, can I specify how many map or reduce tasks I want to be launched? From what I understand is that it's based on the number of input files passed to hadoop. So if I have 4 files there will be 4 Map taks that will be launced and reducer is dependent on the hashpartitioner.
Using own InputSplit
I am new to hadoop and from what I understand by default hadoop splits the input into blocks. Now this might result in splitting a line of record into 2 pieces and getting spread accross 2 maps. For eg: Line abcd might get split into ab and cd. How can one prevent this in hadoop and pig? I am looking for some examples where I can see how I can specify my own split so that it logically splits based on the record delimiter and not the block size. For some reason I am not able to get right examples online.
Re: Using own InputSplit
thanks! Just thought it's better to post to multiple groups together since I didn't know where it belongs :) On Fri, May 27, 2011 at 10:04 AM, Harsh J ha...@cloudera.com wrote: Mohit, Please do not cross-post a question to multiple lists unless you're announcing something. What you describe, does not happen; and the way the splitting is done for Text files is explained in good detail here: http://wiki.apache.org/hadoop/HadoopMapReduce Hope this solves your doubt :) On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am new to hadoop and from what I understand by default hadoop splits the input into blocks. Now this might result in splitting a line of record into 2 pieces and getting spread accross 2 maps. For eg: Line abcd might get split into ab and cd. How can one prevent this in hadoop and pig? I am looking for some examples where I can see how I can specify my own split so that it logically splits based on the record delimiter and not the block size. For some reason I am not able to get right examples online. -- Harsh J
Re: Using own InputSplit
Actually this link confused me http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Job+Input Clearly, logical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task. But it looks like application doesn't need to do that since it's done default? Or am I misinterpreting this entirely? On Fri, May 27, 2011 at 10:08 AM, Mohit Anchlia mohitanch...@gmail.com wrote: thanks! Just thought it's better to post to multiple groups together since I didn't know where it belongs :) On Fri, May 27, 2011 at 10:04 AM, Harsh J ha...@cloudera.com wrote: Mohit, Please do not cross-post a question to multiple lists unless you're announcing something. What you describe, does not happen; and the way the splitting is done for Text files is explained in good detail here: http://wiki.apache.org/hadoop/HadoopMapReduce Hope this solves your doubt :) On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am new to hadoop and from what I understand by default hadoop splits the input into blocks. Now this might result in splitting a line of record into 2 pieces and getting spread accross 2 maps. For eg: Line abcd might get split into ab and cd. How can one prevent this in hadoop and pig? I am looking for some examples where I can see how I can specify my own split so that it logically splits based on the record delimiter and not the block size. For some reason I am not able to get right examples online. -- Harsh J
How to copy over using dfs
If I have to overwrite a file I generally use hadoop dfs -rm file hadoop dfs -copyFromLocal or -put file Is there a command to overwrite/replace the file instead of doing rm first?
Help with pigsetup
I sent this to pig apache user mailing list but have got no response. Not sure if that list is still active. thought I will post here if someone is able to help me. I am in process of installing and learning pig. I have a hadoop cluster and when I try to run pig in mapreduce mode it errors out: Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1 Error before Pig is launched ERROR 2999: Unexpected internal error. Failed to create DataStorage java.lang.RuntimeException: Failed to create DataStorage at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) at org.apache.pig.impl.PigContext.connect(PigContext.java:183) at org.apache.pig.PigServer.init(PigServer.java:226) at org.apache.pig.PigServer.init(PigServer.java:215) at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55) at org.apache.pig.Main.run(Main.java:452) at org.apache.pig.Main.main(Main.java:107) Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310 failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) at org.apache.hadoop.ipc.Client.call(Client.java:743) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72) ... 9 more Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
Re: Help with pigsetup
For some reason I don't see that reply from Jonathan in my Inbox. I'll try to google it. What should be my next step in that case? I can't use pig then? On Thu, May 26, 2011 at 10:00 AM, Harsh J ha...@cloudera.com wrote: I think Jonathan Coveney's reply on user@pig answered your question. Its basically an issue of hadoop version differences between the one Pig 0.8.1 release got bundled with vs. Hadoop 0.20.203 release which is newer. On Thu, May 26, 2011 at 10:26 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I sent this to pig apache user mailing list but have got no response. Not sure if that list is still active. thought I will post here if someone is able to help me. I am in process of installing and learning pig. I have a hadoop cluster and when I try to run pig in mapreduce mode it errors out: Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1 Error before Pig is launched ERROR 2999: Unexpected internal error. Failed to create DataStorage java.lang.RuntimeException: Failed to create DataStorage at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) at org.apache.pig.impl.PigContext.connect(PigContext.java:183) at org.apache.pig.PigServer.init(PigServer.java:226) at org.apache.pig.PigServer.init(PigServer.java:215) at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55) at org.apache.pig.Main.run(Main.java:452) at org.apache.pig.Main.main(Main.java:107) Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310 failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) at org.apache.hadoop.ipc.Client.call(Client.java:743) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72) ... 9 more Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) -- Harsh J
Re: Help with pigsetup
On Thu, May 26, 2011 at 10:06 AM, Jonathan Coveney jcove...@gmail.com wrote: I'll repost it here then :) Here is what I had to do to get pig running with a different version of Hadoop (in my case, the cloudera build but I'd try this as well): build pig-withouthadoop.jar by running ant jar-withouthadoop. Then, when you run pig, put the pig-withouthadoop.jar on your classpath as well as your hadoop jar. In my case, I found that scripts only worked if I additionally manually registered the antlr jar: Thanks Jonathan! I will give it a shot. register /path/to/pig/build/ivy/lib/Pig/antlr-runtime-3.2.jar; Is this a windows command? Sorry, have not used this before. 2011/5/26 Mohit Anchlia mohitanch...@gmail.com For some reason I don't see that reply from Jonathan in my Inbox. I'll try to google it. What should be my next step in that case? I can't use pig then? On Thu, May 26, 2011 at 10:00 AM, Harsh J ha...@cloudera.com wrote: I think Jonathan Coveney's reply on user@pig answered your question. Its basically an issue of hadoop version differences between the one Pig 0.8.1 release got bundled with vs. Hadoop 0.20.203 release which is newer. On Thu, May 26, 2011 at 10:26 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I sent this to pig apache user mailing list but have got no response. Not sure if that list is still active. thought I will post here if someone is able to help me. I am in process of installing and learning pig. I have a hadoop cluster and when I try to run pig in mapreduce mode it errors out: Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1 Error before Pig is launched ERROR 2999: Unexpected internal error. Failed to create DataStorage java.lang.RuntimeException: Failed to create DataStorage at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) at org.apache.pig.impl.PigContext.connect(PigContext.java:183) at org.apache.pig.PigServer.init(PigServer.java:226) at org.apache.pig.PigServer.init(PigServer.java:215) at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55) at org.apache.pig.Main.run(Main.java:452) at org.apache.pig.Main.main(Main.java:107) Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310 failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) at org.apache.hadoop.ipc.Client.call(Client.java:743) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72) ... 9 more Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) -- Harsh J