Job Tracker/Name Node redundancy
Are there any plans to build redundancy/failover support for the Job Tracker and Name Node components in Hadoop? Let's take the current scenario: 1) A data/cpu intensive job is submitted to a Hadoop cluster of 10 machines. 2) Half-way through the job execution, the Job Tracker or Name Node fails. 3) We bring up a new Job Tracker or Name Node manually. -- Will the individual task trackers / data nodes reconnect to the new masters? Or will the job have to be resubmitted? If we had failover support, we could setup essentially 3 Job Tracker masters and 3 NameNode masters so that if one dies the other would gracefully take over and start handling results from the children nodes. Thanks! Ryan
EC2 Usage?
Hello all, Somewhat of a an off-topic related question, but I know there are Hadoop + EC2 users here. Does anyone know if there is a programmatic API to get find out how many machine time hours have been used by a Hadoop cluster (or anything) running on EC2? I know that you can log into the EC2 web site and see this, but I'm wondering if there's a way to access this data programmaticly via web services? Thanks, Ryan
Re: EC2 Usage?
Thanks! On Thu, Dec 18, 2008 at 12:17 PM, Tom White t...@cloudera.com wrote: Hi Ryan, The ec2-describe-instances command in the API tool reports the launch time for each instance, so you could work out the machine hours of your cluster using that information. Tom On Thu, Dec 18, 2008 at 4:59 PM, Ryan LeCompte lecom...@gmail.com wrote: Hello all, Somewhat of a an off-topic related question, but I know there are Hadoop + EC2 users here. Does anyone know if there is a programmatic API to get find out how many machine time hours have been used by a Hadoop cluster (or anything) running on EC2? I know that you can log into the EC2 web site and see this, but I'm wondering if there's a way to access this data programmaticly via web services? Thanks, Ryan
Re: Streaming data into Hadoop
Even better! I'll try this out tomorrow. Thanks, Ryan On Dec 9, 2008, at 10:36 PM, Aaron Kimball [EMAIL PROTECTED] wrote: Note also that cat foo | bin/hadoop fs -put - some/hdfs/path will use stdin. - Aaron On Mon, Dec 8, 2008 at 5:56 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Just what I need -- thanks! On Mon, Dec 8, 2008 at 7:31 PM, Alex Loddengaard [EMAIL PROTECTED] wrote: This should answer your questions: http://wiki.apache.org/hadoop/MountableHDFS Alex On Mon, Dec 8, 2008 at 2:19 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello all, I normally upload files into hadoop via bin/hadoop fs -put file dest. However, is there a way to somehow stream data into Hadoop? For example, I'd love to do something like this: zcat xxx HADOOP_HDFS_DESTINATION This would save me a ton of time since I don't have to first unpack the .tgz file and upload the raw file into HDFS. Is this possible with Hadoop 0.19? Thanks, Ryan
Streaming data into Hadoop
Hello all, I normally upload files into hadoop via bin/hadoop fs -put file dest. However, is there a way to somehow stream data into Hadoop? For example, I'd love to do something like this: zcat xxx HADOOP_HDFS_DESTINATION This would save me a ton of time since I don't have to first unpack the .tgz file and upload the raw file into HDFS. Is this possible with Hadoop 0.19? Thanks, Ryan
Re: Streaming data into Hadoop
Just what I need -- thanks! On Mon, Dec 8, 2008 at 7:31 PM, Alex Loddengaard [EMAIL PROTECTED] wrote: This should answer your questions: http://wiki.apache.org/hadoop/MountableHDFS Alex On Mon, Dec 8, 2008 at 2:19 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello all, I normally upload files into hadoop via bin/hadoop fs -put file dest. However, is there a way to somehow stream data into Hadoop? For example, I'd love to do something like this: zcat xxx HADOOP_HDFS_DESTINATION This would save me a ton of time since I don't have to first unpack the .tgz file and upload the raw file into HDFS. Is this possible with Hadoop 0.19? Thanks, Ryan
Re: stack trace from hung task
For what it's worth, I started seeing these when I upgraded to 0.19. I was using 10 reduces, but changed it to 30 reduces for my job and now I don't see these errors any more. Thanks, Ryan On Fri, Dec 5, 2008 at 2:44 PM, Sriram Rao [EMAIL PROTECTED] wrote: Hi, When a task tracker kills a non-responsive task, it prints out a message Task X not reported status for 600 seconds. Killing!. The stack trace it then dumps out is that of the task tracker itself. Is there a way to get the hung task to dump out its stack trace before exiting? Would be nice if there was an easy way to send a kill -3 to the hung process and then kill it. Sriram
Hadoop balancer
I've tried running the bin/hadoop balance command since I recently added a new node to the Hadoop cluster. I noticed the following output in the beginning: 08/12/03 10:26:35 INFO balancer.Balancer: Will move 10 GBbytes in this iteration Dec 3, 2008 10:26:35 AM 0 0 KB 2.67 GB 10 GB 08/12/03 10:26:36 WARN balancer.Balancer: Error moving block -6653850537520285828 from 10.30.2.31:50010 to 10.30.2.23:50010 through 10.30.2.31:50010: block move is failed However, subsequent iterations have succeeded. Is this harmless or is there a possibility that data was corrupted during the re-balancing attempt? Thanks, Ryan
Hadoop and .tgz files
Hello all, I'm using Hadoop 0.19 and just discovered that it has no problems processing .tgz files that contain text files. I was under the impression that it wouldn't be able to break a .tgz file up into multiple maps, but instead just treat it as 1 map per .tgz file. Was this a recent change or enhancement? I'm noticing that it is breaking up the .tgz file into multiple maps. Thanks, Ryan
Re: Hadoop and .tgz files
I believe I spoke a little too soon. Looks like Hadoop supports .gz files, not .tgz. :-) On Mon, Dec 1, 2008 at 10:46 AM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello all, I'm using Hadoop 0.19 and just discovered that it has no problems processing .tgz files that contain text files. I was under the impression that it wouldn't be able to break a .tgz file up into multiple maps, but instead just treat it as 1 map per .tgz file. Was this a recent change or enhancement? I'm noticing that it is breaking up the .tgz file into multiple maps. Thanks, Ryan
Re: Question regarding reduce tasks
What happens when the reducer task gets invoked more than once? My guess is once a reducer task finishes writing the data for a particular key to HDFS, it won't somehow get re-executed again for the same key right? On Mon, Nov 3, 2008 at 11:28 AM, Miles Osborne [EMAIL PROTECTED] wrote: you can't guarantee that a reducer (or mapper for that matter) will be executed exactly once unless you turn-off preemptive scheduling. but, a distinct key gets sent to a single reducer, so yes, only one reducer will see a particulat key + associated values Miles 2008/11/3 Ryan LeCompte [EMAIL PROTECTED]: Hello, Is it safe to assume that only one reduce task will ever operate on values for a particular key? Or is it possible that more than one reduce task can work on values for the same key? The reason I ask is because I want to ensure that a piece of code that I write at the end of my reducer method will only ever be executed once after all values for a particular key are aggregated/summed. Thanks, Ryan -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: NotYetReplicated exceptions when pushing large files into HDFS
Thanks. Is there a way to increase the retry amount? Ryan On Mon, Sep 22, 2008 at 8:21 PM, lohit [EMAIL PROTECTED] wrote: Yes, these are warning unless they fail for 3 times. In which case your dfs -put command would fail with stack trace. Thanks, Lohit - Original Message From: Ryan LeCompte [EMAIL PROTECTED] To: core-user@hadoop.apache.org core-user@hadoop.apache.org Sent: Monday, September 22, 2008 5:18:01 PM Subject: Re: NotYetReplicated exceptions when pushing large files into HDFS I've noticed that although I get a few of these exceptions, the file is ultimately uploaded to the HDFS cluster. Does this mean that my file ended up getting there in 1 piece? The exceptions are just logged at the WARN level and indicate retry attempts. Thanks, Ryan On Mon, Sep 22, 2008 at 11:08 AM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello all, I'd love to be able to upload into HDFS very large files (e.g., 8 or 10GB), but it seems like my only option is to chop up the file into smaller pieces. Otherwise, after a while I get NotYetReplication exceptions while the transfer is in progress. I'm using 0.18.1. Is there any way I can do this? Perhaps use something else besides bin/hadoop -put input output? Thanks, Ryan
Re: NotYetReplicated exceptions when pushing large files into HDFS
I've noticed that although I get a few of these exceptions, the file is ultimately uploaded to the HDFS cluster. Does this mean that my file ended up getting there in 1 piece? The exceptions are just logged at the WARN level and indicate retry attempts. Thanks, Ryan On Mon, Sep 22, 2008 at 11:08 AM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello all, I'd love to be able to upload into HDFS very large files (e.g., 8 or 10GB), but it seems like my only option is to chop up the file into smaller pieces. Otherwise, after a while I get NotYetReplication exceptions while the transfer is in progress. I'm using 0.18.1. Is there any way I can do this? Perhaps use something else besides bin/hadoop -put input output? Thanks, Ryan
Reduce tasks running out of memory on small hadoop cluster
Hello all, I'm setting up a small 3 node hadoop cluster (1 node for namenode/jobtracker and the other two for datanode/tasktracker). The map tasks finish fine, but the reduce tasks are failing at about 30% with an out of memory error. My guess is because the amount of data that I'm crunching through just won't be able to fit in memory during the reduce tasks on two machines (max of 2 reduce tasks on each machine). Is this expected? If I had a large hadoop cluster, then I could increase the number of reduce tasks on each machine of the cluster so that not all of the data to be processed is occurring in just 4 JVMs on two machines like I currently have setup, correct? Is there any way to get the reduce task to not try and hold all of the data in memory, or is my only option to add more nodes to the cluster to therefore increase the number of reduce tasks? Thanks! Ryan
Re: Reduce tasks running out of memory on small hadoop cluster
Yes I did, but that didn't solve my problem since I'm working with a fairly large data set (8gb). Thanks, Ryan On Sep 21, 2008, at 12:22 AM, Sandy [EMAIL PROTECTED] wrote: Have you increased the heapsize in conf/hadoop-env.sh to 2000? This helped me some, but eventually I had to upgrade to a system with more memory. -SM On Sat, Sep 20, 2008 at 9:07 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello all, I'm setting up a small 3 node hadoop cluster (1 node for namenode/jobtracker and the other two for datanode/tasktracker). The map tasks finish fine, but the reduce tasks are failing at about 30% with an out of memory error. My guess is because the amount of data that I'm crunching through just won't be able to fit in memory during the reduce tasks on two machines (max of 2 reduce tasks on each machine). Is this expected? If I had a large hadoop cluster, then I could increase the number of reduce tasks on each machine of the cluster so that not all of the data to be processed is occurring in just 4 JVMs on two machines like I currently have setup, correct? Is there any way to get the reduce task to not try and hold all of the data in memory, or is my only option to add more nodes to the cluster to therefore increase the number of reduce tasks? Thanks! Ryan
Re: Why can't Hadoop be used for online applications ?
Hadoop is best suited for distributed processing across many machines of large data sets. Most people use Hadoop to plow through large data sets in an offline fashion. One approach that you can use is to use Hadoop to process your data, then put it in an optimized form in HBase (i.e., similar to Google's Bigtable). Then, you can use HBase for querying the data in an online-access fashion. Refer to http://hadoop.apache.org/hbase/ for more information about HBase. Ryan On Fri, Sep 12, 2008 at 2:46 PM, souravm [EMAIL PROTECTED] wrote: Hi, Here is a bsic doubt. I found in different documentation it is mentioned that Hadoop is not recommended for online applications. Can anyone please elaborate on the same ? Regards, Sourav CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
Re: Why can't Hadoop be used for online applications ?
Hey Camilo, HBase is not meant to be a replacement for MySQL or a traditional RDBMS (HBase is not transaction, for instance). I'd recommend reading the following article that describes what HBase/Bigtable really is: http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable Thanks, Ryan On Fri, Sep 12, 2008 at 3:25 PM, Camilo Gonzalez [EMAIL PROTECTED] wrote: Hi Ryan! Does this means that HBase could be used for Online applications, for example, replacing MySQL in database-driven applications? Does anyone have any kind of benchmarks about the comparison between MySQL queries/updates and HBase queries/updates? Have a nice day, Camilo. On Fri, Sep 12, 2008 at 1:55 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hadoop is best suited for distributed processing across many machines of large data sets. Most people use Hadoop to plow through large data sets in an offline fashion. One approach that you can use is to use Hadoop to process your data, then put it in an optimized form in HBase (i.e., similar to Google's Bigtable). Then, you can use HBase for querying the data in an online-access fashion. Refer to http://hadoop.apache.org/hbase/ for more information about HBase. Ryan On Fri, Sep 12, 2008 at 2:46 PM, souravm [EMAIL PROTECTED] wrote: Hi, Here is a bsic doubt. I found in different documentation it is mentioned that Hadoop is not recommended for online applications. Can anyone please elaborate on the same ? Regards, Sourav CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
Re: Issue in reduce phase with SortedMapWritable and custom Writables as values
Okay, I think I'm getting closer but now I'm running into another problem. First off, I created my own CustomMapWritable that extends MapWritable and invokes AbstractMapWritable.addToMap() to add my custom classes. Now the map/reduce phases actually complete and the job as a whole completes. However, when I try to use the SequenceFile API to later read the output data, I'm getting a strange exception. First the code: FileSystem fileSys = FileSystem.get(conf); SequenceFile.Reader reader = new SequenceFile.Reader(fileSys, inFile, conf); Text key = new Text(); CustomWritable stats = new CustomWritable(); reader.next(key, stats); reader.close(); And now the exception that's thrown: java.io.IOException: can't find class: com.test.CustomStatsWritable because com.test.CustomStatsWritable at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:210) at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:145) at com.test.CustomStatsWritable.readFields(UserStatsWritable.java:49) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879) ... Any ideas? Thanks, Ryan On Tue, Sep 9, 2008 at 12:36 AM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello, I'm attempting to use a SortedMapWritable with a LongWritable as the key and a custom implementation of org.apache.hadoop.io.Writable as the value. I notice that my program works fine when I use another primitive wrapper (e.g. Text) as the value, but fails with the following exception when I use my custom Writable instance: 2008-09-08 23:25:02,072 INFO org.apache.hadoop.mapred.ReduceTask: Initiating in-memory merge with 1 segments... 2008-09-08 23:25:02,077 INFO org.apache.hadoop.mapred.Merger: Merging 1 sorted segments 2008-09-08 23:25:02,077 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 5492 bytes 2008-09-08 23:25:02,099 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200809082247_0005_r_00_0 Merge of the inmemory files threw a n exception: java.io.IOException: Intermedate merge failed at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2133) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2064) Caused by: java.lang.RuntimeException: java.lang.NullPointerException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:80) at org.apache.hadoop.io.SortedMapWritable.readFields(SortedMapWritable.java:179) ... I noticed that the AbstractMapWritable class has a protected addToMap(Class clazz) method. Do I somehow need to let my SortedMapWritable instance know about my custom Writable value? I've properly implemented the custom Writable object (it just contains a few primitives, like longs and ints). Any insight is appreciated. Thanks, Ryan
Re: Issue in reduce phase with SortedMapWritable and custom Writables as values
Based on some similar problems that I found others were having in the mailing lists, it looks like the solution was to list my Map/Reduce job JAR In the conf/hadoop-env.sh file under HADOOP_CLASSPATH. After doing that and re-submitting the job, it all worked fine! I guess the MapWritable class somehow doesn't share the same classpath as the program that actually submits the job conf. Is this expected? Thanks, Ryan On Tue, Sep 9, 2008 at 9:44 AM, Ryan LeCompte [EMAIL PROTECTED] wrote: Okay, I think I'm getting closer but now I'm running into another problem. First off, I created my own CustomMapWritable that extends MapWritable and invokes AbstractMapWritable.addToMap() to add my custom classes. Now the map/reduce phases actually complete and the job as a whole completes. However, when I try to use the SequenceFile API to later read the output data, I'm getting a strange exception. First the code: FileSystem fileSys = FileSystem.get(conf); SequenceFile.Reader reader = new SequenceFile.Reader(fileSys, inFile, conf); Text key = new Text(); CustomWritable stats = new CustomWritable(); reader.next(key, stats); reader.close(); And now the exception that's thrown: java.io.IOException: can't find class: com.test.CustomStatsWritable because com.test.CustomStatsWritable at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:210) at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:145) at com.test.CustomStatsWritable.readFields(UserStatsWritable.java:49) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879) ... Any ideas? Thanks, Ryan On Tue, Sep 9, 2008 at 12:36 AM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello, I'm attempting to use a SortedMapWritable with a LongWritable as the key and a custom implementation of org.apache.hadoop.io.Writable as the value. I notice that my program works fine when I use another primitive wrapper (e.g. Text) as the value, but fails with the following exception when I use my custom Writable instance: 2008-09-08 23:25:02,072 INFO org.apache.hadoop.mapred.ReduceTask: Initiating in-memory merge with 1 segments... 2008-09-08 23:25:02,077 INFO org.apache.hadoop.mapred.Merger: Merging 1 sorted segments 2008-09-08 23:25:02,077 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 5492 bytes 2008-09-08 23:25:02,099 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200809082247_0005_r_00_0 Merge of the inmemory files threw a n exception: java.io.IOException: Intermedate merge failed at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2133) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2064) Caused by: java.lang.RuntimeException: java.lang.NullPointerException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:80) at org.apache.hadoop.io.SortedMapWritable.readFields(SortedMapWritable.java:179) ... I noticed that the AbstractMapWritable class has a protected addToMap(Class clazz) method. Do I somehow need to let my SortedMapWritable instance know about my custom Writable value? I've properly implemented the custom Writable object (it just contains a few primitives, like longs and ints). Any insight is appreciated. Thanks, Ryan
Issue in reduce phase with SortedMapWritable and custom Writables as values
Hello, I'm attempting to use a SortedMapWritable with a LongWritable as the key and a custom implementation of org.apache.hadoop.io.Writable as the value. I notice that my program works fine when I use another primitive wrapper (e.g. Text) as the value, but fails with the following exception when I use my custom Writable instance: 2008-09-08 23:25:02,072 INFO org.apache.hadoop.mapred.ReduceTask: Initiating in-memory merge with 1 segments... 2008-09-08 23:25:02,077 INFO org.apache.hadoop.mapred.Merger: Merging 1 sorted segments 2008-09-08 23:25:02,077 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 5492 bytes 2008-09-08 23:25:02,099 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200809082247_0005_r_00_0 Merge of the inmemory files threw a n exception: java.io.IOException: Intermedate merge failed at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2133) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2064) Caused by: java.lang.RuntimeException: java.lang.NullPointerException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:80) at org.apache.hadoop.io.SortedMapWritable.readFields(SortedMapWritable.java:179) ... I noticed that the AbstractMapWritable class has a protected addToMap(Class clazz) method. Do I somehow need to let my SortedMapWritable instance know about my custom Writable value? I've properly implemented the custom Writable object (it just contains a few primitives, like longs and ints). Any insight is appreciated. Thanks, Ryan
Re: Multiple input files
Hi Sayali, Yes, you can submit a collection of files from HDFS as input to the job. Please take a look at the WordCount example in the Map/Reduce tutorial for an example: http://hadoop.apache.org/core/docs/r0.18.0/mapred_tutorial.html#Example%3A+WordCount+v1.0 Ryan On Sat, Sep 6, 2008 at 9:03 AM, Sayali Kulkarni [EMAIL PROTECTED] wrote: Hello, When starting a hadoop job, I need to specify an input file and an output file. Can I instead specify a list of input files? example, I have the input distributed in the files: file000, file001, file002, file003, ... So I can I specify input files as file*. I can add all my files to HDFS. Thanks in advance! --Sayali
Multiple output files
Hello, I have a question regarding multiple output files that get produced as a result of using multiple reduce tasks for a job (as opposed to only one). If I'm using a custom writable and thus writing to a sequence output, am I gauranteed that all of the day for a particular key will appear in a single output file (e.g., part-), or is it possible that the values could be split across multiple part- files? At the end of the job I'm using the sequence file reader to read each custom key/writable pair from each output file. Is it possible that the same key could appear in multiple output files? If so, does Hadoop automatically grab all of the values for a particular key in all of the output files? Thanks, Ryan
Re: Multiple output files
This clears up my concerns. Thanks! Ryan On Sep 6, 2008, at 2:17 PM, Owen O'Malley [EMAIL PROTECTED] wrote: On Sep 6, 2008, at 9:35 AM, Ryan LeCompte wrote: I have a question regarding multiple output files that get produced as a result of using multiple reduce tasks for a job (as opposed to only one). If I'm using a custom writable and thus writing to a sequence output, am I gauranteed that all of the day for a particular key will appear in a single output file (e.g., part-), or is it possible that the values could be split across multiple part- files? Each key will be processed by exactly one reduce. All of the keys to each reduce will be sorted. The application can define a Partitioner that picks the reduce for each key. The default one uses key.hashCode() % numReduces, which is usually balanced. If your key had both a date and time and you wanted to have all of the transactions for a given day in the same reduce, you could do: class MyKey { Date date; Time time; } and use a partitioner like: public class MyPartitioner extends Partitioner { public int getPartition(MyKey key, MyValue value, int numReduceTasks) { return (key.date.hashCode() Integer.MAX_VALUE) % numReduceTasks; } } Of course the risk is that you may have very unbalanced reduce sizes, depending on your data. At the end of the job I'm using the sequence file reader to read each custom key/writable pair from each output file. Is it possible that the same key could appear in multiple output files? No. -- Owen
Custom Writeables
Hello, Can a custom Writeable object used as a key/value contain other writeables, like MapWriteable? Thanks, Ryan
Hadoop + Elastic Block Stores
Hello, I was wondering if anyone has gotten far at all with getting Hadoop up and running with EC2 + EBS? Any luck getting this to work in a way that the HDFS runs on the EBS so that it isn't blown away every time you bring up/down the EC2 Hadoop cluster? I'd like to experiment with this next, and was curious if anyone had any luck. :) Thanks! Ryan
Re: Hadoop + Elastic Block Stores
Good to know that you got it up and running. I'd really love to one day see some scripts under src/contrib/ec2/bin that can setup/mount the EBS volumes automatically. :-) On Sep 5, 2008, at 11:38 PM, Jean-Daniel Cryans [EMAIL PROTECTED] wrote: Ryan, I currently have a Hadoop/HBase setup that uses EBS. It works but using EBS implied an additional overhead of configuration (too bad you can't spawn instances with volumes already attached to it tho I'm sure that'll come). Shutting down instances and bringing others up also requires more micro-management but I think Tom White wrote about it and there was a link to it in another discussion you were part of. Hope this helps, J-D On Fri, Sep 5, 2008 at 7:00 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello, I was wondering if anyone has gotten far at all with getting Hadoop up and running with EC2 + EBS? Any luck getting this to work in a way that the HDFS runs on the EBS so that it isn't blown away every time you bring up/down the EC2 Hadoop cluster? I'd like to experiment with this next, and was curious if anyone had any luck. :) Thanks! Ryan
Re: Hadoop EC2
Hi Tom, This clears up my questions. Thanks! Ryan On Thu, Sep 4, 2008 at 9:21 AM, Tom White [EMAIL PROTECTED] wrote: On Thu, Sep 4, 2008 at 1:46 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: I'm noticing that using bin/hadoop fs -put ... svn://... is uploading multi-gigabyte files in ~64MB chunks. That's because S3Filesystem stores files as 64MB blocks on S3. Then, when this is copied from S3 into HDFS using bin/hadoop distcp. Once the files are there and the job begins, it looks like it's breaking up the 4 multigigabyte text files into about 225 maps. Does this mean that each map is roughly processing 64MB of data each? Yes, HDFS stores files as 64MB blocks too, and map input is split by default so each map processes one block. If so, is there any way to change this so that I can get my map tasks to process more data at a time? I'm curious if this will shorten the time it takes to run the program. You could try increasing the HDFS block size. 128MB is actually usually a better value, for this very reason. In the future https://issues.apache.org/jira/browse/HADOOP-2560 will help here too. Tom, in your article about Hadoop + EC2 you mention processing about 100GB of logs in under 6 minutes or so. In this article: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873, it took 35 minutes to run the job. I'm planning on doing some benchmarking on EC2 fairly soon, which should help us improve the performance of Hadoop on EC2. It's worth remarking that this was running on small instances. The larger instances perform a lot better in my experience. Do you remember how many EC2 instances you had running, and also how many map tasks did you have to operate on the 100GB? Was each map task handling about 1GB each? I was running 20 nodes, and each map task was handling a HDFS block, 64MB. Hope this helps, Tom
Re: EC2 AMI for Hadoop 0.18.0
Works great! My only suggestion would be to modify the /usr/local/hadoop-0.18.0/conf/hadoop-site.xml file to use hdfs://... for the namenode address. Otherwise I constantly get warnings saying that the syntax is deprecated any time I submit a job for execution or interact with HDFS via bin/hadoop fs -ls ... Thanks, Ryan On Thu, Sep 4, 2008 at 10:25 AM, Stuart Sierra [EMAIL PROTECTED] wrote: Thanks, Tom! -Stuart On Wed, Sep 3, 2008 at 12:43 PM, Tom White [EMAIL PROTECTED] wrote: I've just created public AMIs for 0.18.0. Note that they are in the hadoop-images bucket. Tom On Fri, Aug 29, 2008 at 9:22 PM, Karl Anderson [EMAIL PROTECTED] wrote: On 29-Aug-08, at 6:49 AM, Stuart Sierra wrote: Anybody have one? Any success building it with create-hadoop-image? Thanks, -Stuart I was able to build one following the instructions in the wiki. You'll need to find the Java download url (see wiki) and put it and your own S3 bucket name in hadoop-ec2-env.sh.
Re: Hadoop EC2
Tom, I noticed that you mentioned using Amazon's new elastic block store as an alternative to using S3. Right now I'm testing pushing data to S3, then moving it from S3 into HDFS once the Hadoop cluster is up and running in EC2. It works pretty well -- moving data from S3 to HDFS is fast when the data in S3 is broken up into multiple files, since bin/hadoop distcp uses a Map/Reduce job to efficiently transfer the data. Are there any real advantages to using the new elastic block store? Is moving data from the elastic block store into HDFS any faster than doing it from S3? Or can HDFS essentially live inside of the elastic block store? Thanks! Ryan On Wed, Sep 3, 2008 at 9:54 AM, Tom White [EMAIL PROTECTED] wrote: There's a case study with some numbers in it from a presentation I gave on Hadoop and AWS in London last month, which you may find interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf. tim robertson [EMAIL PROTECTED] wrote: For these small datasets, you might find it useful - let me know if I should spend time finishing it (Or submit help?) - it is really very simple. This sounds very useful. Please consider creating a Jira and submitting the code (even if it's not finished folks might like to see it). Thanks. Tom Cheers Tim On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hi Tim, Are you mostly just processing/parsing textual log files? How many maps/reduces did you configure in your hadoop-ec2-env.sh file? How many did you configure in your JobConf? Just trying to get an idea of what to expect in terms of performance. I'm noticing that it takes about 16 minutes to transfer about 15GB of textual uncompressed data from S3 into HDFS after the cluster has started with 15 nodes. I was expecting this to take a shorter amount of time, but maybe I'm incorrect in my assumptions. I am also noticing that it takes about 15 minutes to parse through the 15GB of data with a 15 node cluster. Thanks, Ryan On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote: I have been processing only 100s GBs on EC2, not 1000's and using 20 nodes and really only in exploration and testing phase right now. On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote: Hi Ryan, Just a heads up, if you require more than the 20 node limit, Amazon provides a form to request a higher limit: http://www.amazon.com/gp/html-forms-controller/ec2-request Andrew On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello all, I'm curious to see how many people are using EC2 to execute their Hadoop cluster and map/reduce programs, and how many are using home-grown datacenters. It seems like the 20 node limit with EC2 is a bit crippling when one wants to process many gigabytes of data. Has anyone found this to be the case? How much data are people processing with their 20 node limit on EC2? Curious what the thoughts are... Thanks, Ryan
Re: Hadoop EC2
Hi Tim, Are you mostly just processing/parsing textual log files? How many maps/reduces did you configure in your hadoop-ec2-env.sh file? How many did you configure in your JobConf? Just trying to get an idea of what to expect in terms of performance. I'm noticing that it takes about 16 minutes to transfer about 15GB of textual uncompressed data from S3 into HDFS after the cluster has started with 15 nodes. I was expecting this to take a shorter amount of time, but maybe I'm incorrect in my assumptions. I am also noticing that it takes about 15 minutes to parse through the 15GB of data with a 15 node cluster. Thanks, Ryan On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote: I have been processing only 100s GBs on EC2, not 1000's and using 20 nodes and really only in exploration and testing phase right now. On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote: Hi Ryan, Just a heads up, if you require more than the 20 node limit, Amazon provides a form to request a higher limit: http://www.amazon.com/gp/html-forms-controller/ec2-request Andrew On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello all, I'm curious to see how many people are using EC2 to execute their Hadoop cluster and map/reduce programs, and how many are using home-grown datacenters. It seems like the 20 node limit with EC2 is a bit crippling when one wants to process many gigabytes of data. Has anyone found this to be the case? How much data are people processing with their 20 node limit on EC2? Curious what the thoughts are... Thanks, Ryan
Re: Hadoop EC2
Hi Tim, Thanks for responding -- I believe that I'll need the full power of Hadoop since I'll want this to scale well beyond 100GB of data. Thanks for sharing your experiences -- I'll definitely check out your blog. Thanks! Ryan On Tue, Sep 2, 2008 at 8:47 AM, tim robertson [EMAIL PROTECTED] wrote: Hi Ryan, I actually blogged my experience as it was my first usage of EC2: http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html My input data was not log files but actually a dump if 150million records from Mysql into about 13 columns of tab file data I believe. It was a couple of months ago, but I remember thinking S3 was very slow... I ran some simple operations like distinct values of one column based on another (species within a cell) and also did some Polygon analysis since to do is this point in this polygon does not really scale too well in PostGIS. Incidentally, I have most of the basics of a MapReduce-Lite which I aim to port to use the exact Hadoop API since I am *only* working on 10's-100's GB of data and find that it is running really fine on my laptop and I don't need the distributed failover. My goal for that code is for people like me who want to know that I can scale to terrabyte processing, but don't need to take the plunge to full Hadoop deployment yet, but will know that I can migrate the processing in the future as things grow. It runs on the normal filesystem, and single node only (e.g. multithreaded), and performs very quickly since it is just doing java NIO bytebuffers in parallel on the underlying filesystem - on my laptop I Map+Sort+Combine about 130,000 jobs a seconds (simplest of simple map operations). For these small datasets, you might find it useful - let me know if I should spend time finishing it (Or submit help?) - it is really very simple. Cheers Tim On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hi Tim, Are you mostly just processing/parsing textual log files? How many maps/reduces did you configure in your hadoop-ec2-env.sh file? How many did you configure in your JobConf? Just trying to get an idea of what to expect in terms of performance. I'm noticing that it takes about 16 minutes to transfer about 15GB of textual uncompressed data from S3 into HDFS after the cluster has started with 15 nodes. I was expecting this to take a shorter amount of time, but maybe I'm incorrect in my assumptions. I am also noticing that it takes about 15 minutes to parse through the 15GB of data with a 15 node cluster. Thanks, Ryan On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote: I have been processing only 100s GBs on EC2, not 1000's and using 20 nodes and really only in exploration and testing phase right now. On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote: Hi Ryan, Just a heads up, if you require more than the 20 node limit, Amazon provides a form to request a higher limit: http://www.amazon.com/gp/html-forms-controller/ec2-request Andrew On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello all, I'm curious to see how many people are using EC2 to execute their Hadoop cluster and map/reduce programs, and how many are using home-grown datacenters. It seems like the 20 node limit with EC2 is a bit crippling when one wants to process many gigabytes of data. Has anyone found this to be the case? How much data are people processing with their 20 node limit on EC2? Curious what the thoughts are... Thanks, Ryan
Re: Error while uploading large file to S3 via Hadoop 0.18
Actually not if you're using the s3:// as opposed to s3n:// ... Thanks, Ryan On Tue, Sep 2, 2008 at 11:21 AM, James Moore [EMAIL PROTECTED] wrote: On Mon, Sep 1, 2008 at 1:32 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello, I'm trying to upload a fairly large file (18GB or so) to my AWS S3 account via bin/hadoop fs -put ... s3://... Isn't the maximum size of a file on s3 5GB? -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
Re: Hadoop EC2
How can you ensure that the S3 buckets and EC2 instances belong to a certain zone? Ryan On Tue, Sep 2, 2008 at 2:38 PM, Karl Anderson [EMAIL PROTECTED] wrote: On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote: Hi Tim, Are you mostly just processing/parsing textual log files? How many maps/reduces did you configure in your hadoop-ec2-env.sh file? How many did you configure in your JobConf? Just trying to get an idea of what to expect in terms of performance. I'm noticing that it takes about 16 minutes to transfer about 15GB of textual uncompressed data from S3 into HDFS after the cluster has started with 15 nodes. I was expecting this to take a shorter amount of time, but maybe I'm incorrect in my assumptions. I am also noticing that it takes about 15 minutes to parse through the 15GB of data with a 15 node cluster. I'm seeing much faster speeds. With 128 nodes running a mapper-only downloading job, downloading 30 GB takes roughly a minute, less time than the end of job work (which I assume is HDFS replication and bookkeeping). More mappers gives you more parallel downloads, of course. I'm using a Python REST client for S3, and only move data to or from S3 when Hadoop is done with it. Make sure your S3 buckets and EC2 instances are in the same zone.
Re: JVM Spawning
I see... so there really isn't a way for me to test a map/reduce program using a single node without incurring the overhead of upping/downing JVM's... My input is broken up into 5 text files is there a way I could start the job such that it only uses 1 map to process the whole thing? I guess I'd have to concatenate the files into 1 file and somehow turn off splitting? Ryan On Wed, Sep 3, 2008 at 12:09 AM, Owen O'Malley [EMAIL PROTECTED] wrote: On Sep 2, 2008, at 9:00 PM, Ryan LeCompte wrote: Beginner's question: If I have a cluster with a single node that has a max of 1 map/1 reduce, and the job submitted has 50 maps... Then it will process only 1 map at a time. Does that mean that it's spawning 1 new JVM for each map processed? Or re-using the same JVM when a new map can be processed? It creates a new JVM for each task. Devaraj is working on https://issues.apache.org/jira/browse/HADOOP-249 which will allow the jvms to run multiple tasks sequentially. -- Owen
Error while uploading large file to S3 via Hadoop 0.18
Hello, I'm trying to upload a fairly large file (18GB or so) to my AWS S3 account via bin/hadoop fs -put ... s3://... It copies for a good 15 or 20 minutes, and then eventually errors out with a failed retry attempt (saying that it can't retry since it has already written a certain number of bytes, etc. sorry don't have the original error message at the moment). Has anyone experienced anything similar? Can anyone suggest a workaround or a way to specify retries? Should I use another tool for uploading large files to s3? Thanks, Ryan
Re: Error while uploading large file to S3 via Hadoop 0.18
Thanks, trying it now! Ryan On Mon, Sep 1, 2008 at 6:04 PM, Albert Chern [EMAIL PROTECTED] wrote: Increase the retry buffer size in jets3t.properties and maybe up the number of retries while you're at it. If there is no template file included in Hadoop's conf dir you can find it at the jets3t web site. Make sure that it's from the same version that your copy of Hadoop is using. On Mon, Sep 1, 2008 at 1:32 PM, Ryan LeCompte [EMAIL PROTECTED] wrote: Hello, I'm trying to upload a fairly large file (18GB or so) to my AWS S3 account via bin/hadoop fs -put ... s3://... It copies for a good 15 or 20 minutes, and then eventually errors out with a failed retry attempt (saying that it can't retry since it has already written a certain number of bytes, etc. sorry don't have the original error message at the moment). Has anyone experienced anything similar? Can anyone suggest a workaround or a way to specify retries? Should I use another tool for uploading large files to s3? Thanks, Ryan
Hadoop EC2
Hello all, I'm curious to see how many people are using EC2 to execute their Hadoop cluster and map/reduce programs, and how many are using home-grown datacenters. It seems like the 20 node limit with EC2 is a bit crippling when one wants to process many gigabytes of data. Has anyone found this to be the case? How much data are people processing with their 20 node limit on EC2? Curious what the thoughts are... Thanks, Ryan
Reduce hanging with custom value objects?
Hello all, I'm new to Hadoop. I'm trying to write a small hadoop map/reduce program that instead of reading/writing the primitive LongWritable,IntWritable, etc. classes I'm using a custom object that I wrote that implements the Writable interface. I'm still using a LongWritable for the keys, but using my CustomWritable for the values. My input is still a LongWritable,Text because I'm parsing a raw log file. However, the output of the map is LongWritable,CustomWritable and the input to reduce is LongWritable,CustomWritable, and the output from reduce is LongWritable,CustomWritable. I'm noticing that the map part processes fine, however when it gets to reduce it just hangs at 0%. I'm not seeing any useful output in the logs either. Here's my job configuration: conf.setJobName(test); conf.setOutputKeyClass(LongWritable.class); conf.setOutputValueClass(CustomWritable.class); conf.setMapperClass(MapClass.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); FileInputFormat.setInputPaths(conf, remainingArguments.get(0)); FileOutputFormat.setOutputPath(conf, new Path(remainingArguments.get(1))); Am I missing something here? Do I need to specify the conf.OutputFormat? Thanks, Ryan