Re: Hadoop streaming performance: elements vs. vectors
hi peter, thinking aloud on this - trade-offs may depend on: * how much grouping would be possible (tracking a PDF would be interesting for metrics) * locality of key/value pairs (distributed among mapper and reducer tasks) to that point, will there be much time spent in the shuffle? if so, it's probably cheaper to shuffle/sort the grouped row vectors than the many small key,value pair in any case, when i had a similar situation on a large data set (2-3 Tb shuffle) a good pattern to follow was: * mapper emitted small key,value pairs * combiner grouped into row vectors that combiner may get invoked both at the end of the map phase and at the beginning of the reduce phase (more benefit) also, using byte arrays if possible to represent values may be able to save much shuffle time best, paco On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch peter.skomor...@gmail.com wrote: Hadoop streaming question: If I am forming a matrix M by summing a number of elements generated on different mappers, is it better to emit tons of lines from the mappers with small key,value pairs for each element, or should I group them into row vectors before sending to the reducers? For example, say I'm summing frequency count matrices M for each user on a different map task, and the reducer combines the resulting sparse user count matrices for use in another calculation. Should I emit the individual elements: i (j, Mij) \n 3 (1, 3.4) \n 3 (2, 3.4) \n 3 (3, 3.4) \n 4 (1, 2.3) \n 4 (2, 5.2) \n Or posting list style vectors? 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n 4 ((1, 2.3), (2, 5.2)) \n Using vectors will at least save some message space, but are there any other benefits to this approach in terms of Hadoop streaming overhead (sorts etc.)? I think buffering issues will not be a huge concern since the length of the vectors have a reasonable upper bound and will be in a sparse format... -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
Re: EC2 Usage?
Ryan, A developer on our team wrote some JSP to add to the Job Tracker, so that job times and other stats could be accessed programmatically via web services: https://issues.apache.org/jira/browse/HADOOP-4559 There's another update coming for that patch in JIRA, to get task data. Paco On Thu, Dec 18, 2008 at 11:17, Tom White t...@cloudera.com wrote: Hi Ryan, The ec2-describe-instances command in the API tool reports the launch time for each instance, so you could work out the machine hours of your cluster using that information. Tom On Thu, Dec 18, 2008 at 4:59 PM, Ryan LeCompte lecom...@gmail.com wrote: Hello all, Somewhat of a an off-topic related question, but I know there are Hadoop + EC2 users here. Does anyone know if there is a programmatic API to get find out how many machine time hours have been used by a Hadoop cluster (or anything) running on EC2? I know that you can log into the EC2 web site and see this, but I'm wondering if there's a way to access this data programmaticly via web services? Thanks, Ryan
Re: Getting Reduce Output Bytes
Hi Lohit, Our teams collects those kinds of measurements using this patch: https://issues.apache.org/jira/browse/HADOOP-4559 Some example Java code in the comments shows how to access the data, which is serialized as JSON. Looks like the red_hdfs_bytes_written value would give you that. Best, Paco On Tue, Nov 25, 2008 at 00:28, lohit [EMAIL PROTECTED] wrote: Hello, Is there an easy way to get Reduce Output Bytes? Thanks, Lohit
Re: combiner stats
Thank you, Devaraj - That explanation helps a lot. Is the following reasonable to say? Combine input records count shown in the Map phase column of the report is a measure of how many times records have passed through the Combiner during merges of intermediate spills. Therefore, it may be larger than the actual count of records which are being merged. Paco On the map side, the combiner is called after sort and during the merges of the intermediate spills. At the end a single spill file is generated. Note that, during the merges, the same record may pass multiple times through the combiner. On Mon, Nov 17, 2008 at 23:04, Devaraj Das [EMAIL PROTECTED] wrote: On 11/18/08 3:59 AM, Paco NATHAN [EMAIL PROTECTED] wrote: Could someone please help explain the job counters shown for Combine records on the JobTracker JSP page? Here's an example from one of our MR jobs. There are Combine input and output record counters shown for both Map phase and Reduce phase. We're not quite sure how to interpret them - Map Phase: Map input records 85,013,261,279 Map output records 85,013,261,279 Combine input records 114,936,724,505 Combine output records 38,750,511,975 Reduce Phase: Combine input records 8,827,017,275 Combine output records 17,986,654 Reduce input groups 2,221,796 Reduce input records 17,986,654 Reduce output records 4,443,590 What makes sense: * Considering the MR job and its data, the 85.0b count for Map output records is expected * I would believe a rate of 85.0b / 38.8b = 2.2 for our combiner * Reduce phase shows Combine output records at 18.0m = Reduce input records at 18.0m * Reduce input groups at 2.2m is expected * Reduce output records at 4.4m is verified What doesn't make sense: * The 115b count for Combine input records during Map phase * The 8.8b count for Combine input records during Reduce phase On the map side, the combiner is called after sort and during the merges of the intermediate spills. At the end a single spill file is generated. Note that, during the merges, the same record may pass multiple times through the combiner. On the reducer side, the combiner would be called only during merges of intermediate data, and the intermediate merges stops at a certain point (we have = io.sort.factor files remaining). Hence the combiner may be called fewer times here... What would be the actual count of records coming out of the Map phase? Thanks, Paco
combiner stats
Could someone please help explain the job counters shown for Combine records on the JobTracker JSP page? Here's an example from one of our MR jobs. There are Combine input and output record counters shown for both Map phase and Reduce phase. We're not quite sure how to interpret them - Map Phase: Map input records 85,013,261,279 Map output records 85,013,261,279 Combine input records 114,936,724,505 Combine output records 38,750,511,975 Reduce Phase: Combine input records 8,827,017,275 Combine output records 17,986,654 Reduce input groups 2,221,796 Reduce input records 17,986,654 Reduce output records 4,443,590 What makes sense: * Considering the MR job and its data, the 85.0b count for Map output records is expected * I would believe a rate of 85.0b / 38.8b = 2.2 for our combiner * Reduce phase shows Combine output records at 18.0m = Reduce input records at 18.0m * Reduce input groups at 2.2m is expected * Reduce output records at 4.4m is verified What doesn't make sense: * The 115b count for Combine input records during Map phase * The 8.8b count for Combine input records during Reduce phase What would be the actual count of records coming out of the Map phase? Thanks, Paco
Re: Auto-shutdown for EC2 clusters
Hi Karl, Rather than using separate key pairs, you can use EC2 security groups to keep track of different clusters. Effectively, that requires a new security group for every cluster -- so just allocate a bunch of different ones in a config file, then have the launch scripts draw from those. We also use EC2 static IP addresses and then have a DNS entry named similarly to each security group, associated with a static IP once that cluster is launched. It's relatively simple to query the running instances and collect them according to security groups. One way to handle detecting failures is just to attempt SSH in a loop. Our rough estimate is that approximately 2% of the attempted EC2 nodes fail at launch. So we allocate more than enough, given that rate. In a nutshell, that's one approach for managing a Hadoop cluster remotely on EC2. Best, Paco On Fri, Oct 24, 2008 at 2:07 PM, Karl Anderson [EMAIL PROTECTED] wrote: On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote: This workflow could be initiated from a crontab -- totally automated. However, we still see occasional failures of the cluster, and must restart manually, but not often. Stability for that has improved much since the 0.18 release. For us, it's getting closer to total automation. FWIW, that's running on EC2 m1.xl instances. Same here. I've always had the namenode and web interface be accessible, but sometimes I don't get the slave nodes - usually zero slaves when this happens, sometimes I only miss one or two. My rough estimate is that this happens 1% of the time. I currently have to notice this and restart manually. Do you have a good way to detect it? I have several Hadoop clusters running at once with the same AWS image and SSH keypair, so I can't count running instances. I could have a separate keypair per cluster and count instances with that keypair, but I'd like to be able to start clusters opportunistically, with more than one cluster doing the same kind of job on different data. Karl Anderson [EMAIL PROTECTED] http://monkey.org/~kra
Re: Auto-shutdown for EC2 clusters
Hi Stuart, Yes, we do that. Ditto on most of what Chris described. We use an AMI which pulls tarballs for Ant, Java, Hadoop, etc., from S3 when it launches. That controls the versions for tools/frameworks, instead of redoing an AMI each time a tool has an update. A remote server -- in our data center -- acts as a controller, to launch and manage the cluster. FWIW, an engineer here wrote those scripts in Python using boto. Had to patch boto, which was submitted. The mapper for the first MR job in the workflow streams in data from S3. Reducers in subsequent jobs have the option to write output to S3 (as Chris mentioned) After the last MR job in the workflow completes, it pushes a message into SQS. The remote server polls SQS, then performs a shutdown of the cluster. We may replace use of SQS with RabbitMQ -- more flexible to broker other kinds of messages between Hadoop on AWS and the controller/consumer of results back in our data center. This workflow could be initiated from a crontab -- totally automated. However, we still see occasional failures of the cluster, and must restart manually, but not often. Stability for that has improved much since the 0.18 release. For us, it's getting closer to total automation. FWIW, that's running on EC2 m1.xl instances. Paco On Thu, Oct 23, 2008 at 9:47 AM, Stuart Sierra [EMAIL PROTECTED] wrote: Hi folks, Anybody tried scripting Hadoop on EC2 to... 1. Launch a cluster 2. Pull data from S3 3. Run a job 4. Copy results to S3 5. Terminate the cluster ... without any user interaction? -Stuart
Re: Can jobs be configured to be sequential
Hi Ravion, The problem you are describing sounds like a workflow where you must be careful to verify certain conditions before proceeding to a next step. We have similar kinds of use cases for Hadoop apps at work, which are essentially ETL. I recommend that you look at http://cascading.org as an abstraction layer for managing these kinds of workflows. We've found it quite useful. Best, Paco On Fri, Oct 17, 2008 at 8:29 PM, Ravion [EMAIL PROTECTED] wrote: Dear all, We have in our Data Warehouse System, about 600 ETL( Extract Transform Load) jobs to create interim data model. SOme jobs are dependent on completion of others. Assume that I create a group id intdependent jobs. Say a group G1 contains 100 jobs , G2 contains another 200 jobs which are dependent on completion of Group G1 and so on. Can we leverage on Haddop so that Hadoop executed G1 first, on failure it wont execute G2 otherwise will continue for G2 and so on.. ? Or do I need to configure N ( where N = total number of groups) Haddop jobs independently and handle by ourselves? Please share your thoughts, thanks Warmest regards, Ravion
Re: Questions about Hadoop
Edward, Can you describe more about Hama, with respect to Hadoop? I've read through the Incubator proposal and your blog -- it's a great approach. Are there any benchmarks available? E.g., size of data sets used, kinds of operations performed, etc. Will this project be able to make use of existing libraries? Best, Paco On Thu, Sep 25, 2008 at 9:31 PM, Edward J. Yoon [EMAIL PROTECTED] wrote: The decision making system seems interesting to me. :) The question I want to ask is whether it is possible to perform statistical analysis on the data using Hadoop and MapReduce. I'm sure Hadoop could do it. FYI, The Hama project is an easy-to-use to matrix algebra and its uses in statistical analysis on Hadoop and Hbase. (It is still in its early stage) /Edward
Re: Questions about Hadoop
Arijit, For workflow, check out http://cascading.org -- that works quite well and fits what you described. Greenplum and Aster Data have announced support for running MR within the context of their relational databases, e.g., http://www.greenplum.com/resources/mapreduce/ In terms of PIG, Hive, these RDBMS vendors, etc., they seem to be quite good for situations where there are lots of ad hoc queries, business intelligence needs short-term, less-technical staff involved. However, if there are large, repeated batch jobs which require significant analytics work, then I'm not so convinced that SQL is the right mind-set for representing the math required for algorithms or for maintaining complex code throughout the software lifecycle. I run an analytics group where our statisticians use R, while our developers use Hadoop, Cascading, etc., at scale on terabytes. One approach is simply to sample data, analyze it in R, then use the analysis to articulate requirements for developers to use at scale. In terms of running R on large data, one issue is that -- in contrast to SAS, where data is handled line-by-line -- R is limited by how much data can be loaded into memory. Another issue is that while some areas of statistical data analysis are suitable for MapReduce, others clearly are not. Mahout or similar projects may go far, but do not expect them to be capable of displacing R, SAS, etc. For example, you can accomplish much by scanning a data set to determine N, sum X, sum X^X, etc., to produce descriptive stats, quantiles, C.I., plots for p.d.f., c.d.f., etc. Quite useful. However, MapReduce requires data independence, so it will not serve well for tasks such as inverting a matrix. You might want to look into Parallel R, and talk with http://www.revolution-computing.com/ Our team has a project which runs Hadoop workflows underneath R. It is at an early stage, and there's no plan yet about a public release. It's not a simple thing to implement by any stretch of the imagination! Best, Paco On Wed, Sep 24, 2008 at 4:39 AM, Arijit Mukherjee [EMAIL PROTECTED] wrote: Thanx Enis. By workflow, I was trying to mean something like a chain of MapReduce jobs - the first one will extract a certain amount of data from the original set and do some computation resulting in a smaller summary, which will then be the input to a further MR job, and so on...somewhat similar to a workflow as in the SOA world. Is it possible to use statistical analysis tools such as R (or say PL/R) within MapReduce on Hadoop? As far as I've heard, Greenplum is working on a custom MapReduce engine over their Greenplum database which will also support PL/R procedures. Arijit Dr. Arijit Mukherjee Principal Member of Technical Staff, Level-II Connectiva Systems (I) Pvt. Ltd. J-2, Block GP, Sector V, Salt Lake Kolkata 700 091, India Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com
Re: Questions about Hadoop
Certainly. It'd be great to talk with others working in analytics and statistical computing, who have been evaluating MapReduce as well. Paco On Wed, Sep 24, 2008 at 7:45 AM, Arijit Mukherjee [EMAIL PROTECTED] wrote: That's a very good overview Paco - thanx for that. I might get back to you with more queries about cascade etc. at some time - hope you wouldn't mind. Regards Arijit Dr. Arijit Mukherjee Principal Member of Technical Staff, Level-II Connectiva Systems (I) Pvt. Ltd. J-2, Block GP, Sector V, Salt Lake Kolkata 700 091, India Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com -Original Message- From: Paco NATHAN [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 24, 2008 6:10 PM To: core-user@hadoop.apache.org; [EMAIL PROTECTED] Subject: Re: Questions about Hadoop Arijit, For workflow, check out http://cascading.org -- that works quite well and fits what you described. Greenplum and Aster Data have announced support for running MR within the context of their relational databases, e.g., http://www.greenplum.com/resources/mapreduce/ In terms of PIG, Hive, these RDBMS vendors, etc., they seem to be quite good for situations where there are lots of ad hoc queries, business intelligence needs short-term, less-technical staff involved. However, if there are large, repeated batch jobs which require significant analytics work, then I'm not so convinced that SQL is the right mind-set for representing the math required for algorithms or for maintaining complex code throughout the software lifecycle. I run an analytics group where our statisticians use R, while our developers use Hadoop, Cascading, etc., at scale on terabytes. One approach is simply to sample data, analyze it in R, then use the analysis to articulate requirements for developers to use at scale. In terms of running R on large data, one issue is that -- in contrast to SAS, where data is handled line-by-line -- R is limited by how much data can be loaded into memory. Another issue is that while some areas of statistical data analysis are suitable for MapReduce, others clearly are not. Mahout or similar projects may go far, but do not expect them to be capable of displacing R, SAS, etc. For example, you can accomplish much by scanning a data set to determine N, sum X, sum X^X, etc., to produce descriptive stats, quantiles, C.I., plots for p.d.f., c.d.f., etc. Quite useful. However, MapReduce requires data independence, so it will not serve well for tasks such as inverting a matrix. You might want to look into Parallel R, and talk with http://www.revolution-computing.com/ Our team has a project which runs Hadoop workflows underneath R. It is at an early stage, and there's no plan yet about a public release. It's not a simple thing to implement by any stretch of the imagination! Best, Paco On Wed, Sep 24, 2008 at 4:39 AM, Arijit Mukherjee [EMAIL PROTECTED] wrote: Thanx Enis. By workflow, I was trying to mean something like a chain of MapReduce jobs - the first one will extract a certain amount of data from the original set and do some computation resulting in a smaller summary, which will then be the input to a further MR job, and so on...somewhat similar to a workflow as in the SOA world. Is it possible to use statistical analysis tools such as R (or say PL/R) within MapReduce on Hadoop? As far as I've heard, Greenplum is working on a custom MapReduce engine over their Greenplum database which will also support PL/R procedures. Arijit Dr. Arijit Mukherjee Principal Member of Technical Staff, Level-II Connectiva Systems (I) Pvt. Ltd. J-2, Block GP, Sector V, Salt Lake Kolkata 700 091, India Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com No virus found in this incoming message. Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date: 9/23/2008 6:32 PM
Re: How to manage a large cluster?
Thanks, Steve - Another flexible approach to handling messages across firewalls, between jt and worker nodes, etc., would be to place an APMQ message broker on the jobtracker and another inside our local network. We're experimenting with RabbitMQ for that. On Tue, Sep 16, 2008 at 4:03 AM, Steve Loughran [EMAIL PROTECTED] wrote: We use a set of Python scripts to manage a daily, (mostly) automated launch of 100+ EC2 nodes for a Hadoop cluster. We also run a listener on a local server, so that the Hadoop job can send notification when it completes, and allow the local server to initiate download of results. Overall, that minimizes the need for having a sysadmin dedicated to the Hadoop jobs -- a small dev team can handle it, while focusing on algorithm development and testing. 1. We have some components that use google talk to relay messages to local boxes behind the firewall. I could imagine hooking up hadoop status events to that too. 2. There's an old paper of mine, Making Web Services that Work, in which I talk about deployment centric development: http://www.hpl.hp.com/techreports/2002/HPL-2002-274.html The idea is that right from the outset, the dev team work on a cluster that resembles production, the CI server builds to it automatically, changes get pushed out to production semi-automatically (you tag the version you want pushed out in SVN, the CI server does the release). The article is focused on services exported to third parties, not back end stuff, so it may not all apply to hadoop deployments. -steve
Re: How to manage a large cluster?
We use an EC2 image onto which we install Java, Ant, Hadoop, etc. To make it simple, pull those from S3 buckets. That provides a flexible pattern for managing the frameworks involved, more so than needing to re-do an EC2 image whenever you want to add a patch to Hadoop. Given that approach, you can add your Hadoop application code similarly. Just upload the current stable build out of SVN, Git, whatever, to an S3 bucket. We use a set of Python scripts to manage a daily, (mostly) automated launch of 100+ EC2 nodes for a Hadoop cluster. We also run a listener on a local server, so that the Hadoop job can send notification when it completes, and allow the local server to initiate download of results. Overall, that minimizes the need for having a sysadmin dedicated to the Hadoop jobs -- a small dev team can handle it, while focusing on algorithm development and testing. Or on EC2 and its competitors, just build a new image whenever you need to update Hadoop itself.
Re: Hadoop (0.18) Spill Failed, out of Heap Space error
Also, that almost always happens early in the map phase of the first MR job which runs on our cluster. Hadoop 0.18.1 on EC2 m1.xl instances. We run 10 MR jobs in sequence, 6hr duration, not seeing the problem repeated after that 1 heap space exception. Paco On Wed, Sep 3, 2008 at 11:42 AM, Florian Leibert [EMAIL PROTECTED] wrote: Hi, we're running 100 XLarge instances (ec2), with a gig of heap space for each task - and are seeing the following error frequently (but not always): # BEGIN PASTE # [exec] 08/09/03 11:21:09 INFO mapred.JobClient: map 43% reduce 5% [exec] 08/09/03 11:21:16 INFO mapred.JobClient: Task Id : attempt_200809031101_0001_m_000220_0, Status : FAILED [exec] java.io.IOException: Spill failed [exec] at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:688) [exec] at org.apache.hadoop.mapred.MapTask.run(MapTask.java:228) [exec] at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209) [exec] Caused by: java.lang.OutOfMemoryError: Java heap space [exec] at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$InMemValBytes.reset(MapTask.java:928) [exec] at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.getVBytesForOffset(MapTask.java:891) [exec] at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:765) [exec] at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1600(MapTask.java:286) [exec] at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:712) # END # Has anyone seen this? Thanks, Florian Leibert Sr. Software Engineer Adknowledge Inc.
Re: question on fault tolerance
just a guess, for a long-running sequence of MR jobs, how's the namenode behaving during that time? if it gets corrupted, one might see that behavior. we have a similar situation, with 9 MR jobs back-to-back, taking much of the day. might be good to add some notification to an external process after the end of each of those 3 MR jobs. paco On Mon, Aug 11, 2008 at 12:34 PM, Mori Bellamy [EMAIL PROTECTED] wrote: hey all, i have a job consisting of three MR jobs back to back to back. the each job takes an appreciable percent of a day to complete (30% to 70%). even though i execute these jobs in a blocking fashion:
Re: iterative map-reduce
A simple example of Hadoop application code which follows that pattern (iterate until condition). In the jyte section: http://code.google.com/p/ceteri-mapred/ Loop and condition test are in the same code which calls ToolRunner and JobClient. Best, Paco On Tue, Jul 29, 2008 at 10:03 AM, Christian Ulrik Søttrup [EMAIL PROTECTED] wrote: Hi Shirley, I am basically doing as Qin suggested. I am running a job iteratively until some condition is met. My main looks something like:(in pseudo code) main: while (!converged): make new jobconf setup jobconf run jobconf check reporter for statistics decide if converged I use a custom reporter to check on the fitness of the solution in the reduce phase. If you need more(real java) code drop me a line. Cheers, Christian
Re: JobTracker History data+analysis
Thank you, Amareshwari - That helps. Hadn't noticed HistoryViewer before. It has no JavaDoc. What is a typical usage? In other words, what would be the outputDir value in the context of ToolRunner, JobClient, etc. ? Paco On Sun, Jul 27, 2008 at 11:48 PM, Amareshwari Sriramadasu [EMAIL PROTECTED] wrote: Can you have a look at org.apache.hadoop.mapred.HistoryViewer and see if it make sense? Thanks Amareshwari Paco NATHAN wrote: We have a need to access data found in the JobTracker History link. Specifically in the Analyse This Job analysis. Must be run in Java, between jobs, in the same code which calls ToolRunner and JobClient. In essence, we need to collect descriptive statistics about task counts and times for map, shuffle, reduce. After tracing the flow of the JSP in src/webapps/job... Is there a better way to get at this data, *not* from the web UI perspective but from the code? Tried to find any applicable patterns in JobTracker, ClusterStatus, JobClient, etc., but no joy. Thanks, Paco
Re: JobTracker History data+analysis
Thanks Amareshwari - That could be quite useful to access summary analysis from within the code. Currently this is not written as a public class, which makes it difficult to use inside application code. Are there plans to make it a public class? Paco On Mon, Jul 28, 2008 at 1:42 AM, Amareshwari Sriramadasu [EMAIL PROTECTED] wrote: HistoryViewer is used in JobClient to view the history files in the directory provided on the command line. The command is $ bin/hadoop job -history history-dir #by default history is stored in output dir. outputDir in the constructor of HistoryViewer is the directory passed on the command-line. You can specify a location to store the history files of a particular job using hadoop.job.history.user.location. If nothing is specified, the logs are stored in the job's output directory i.e. mapred.output.dir. The files are stored in _logs/history/ inside the directory. Thanks Amareshwari Paco NATHAN wrote: Thank you, Amareshwari - That helps. Hadn't noticed HistoryViewer before. It has no JavaDoc. What is a typical usage? In other words, what would be the outputDir value in the context of ToolRunner, JobClient, etc. ? Paco On Sun, Jul 27, 2008 at 11:48 PM, Amareshwari Sriramadasu [EMAIL PROTECTED] wrote: Can you have a look at org.apache.hadoop.mapred.HistoryViewer and see if it make sense? Thanks Amareshwari Paco NATHAN wrote: We have a need to access data found in the JobTracker History link. Specifically in the Analyse This Job analysis. Must be run in Java, between jobs, in the same code which calls ToolRunner and JobClient. In essence, we need to collect descriptive statistics about task counts and times for map, shuffle, reduce. After tracing the flow of the JSP in src/webapps/job... Is there a better way to get at this data, *not* from the web UI perspective but from the code? Tried to find any applicable patterns in JobTracker, ClusterStatus, JobClient, etc., but no joy. Thanks, Paco
JobTracker History data+analysis
We have a need to access data found in the JobTracker History link. Specifically in the Analyse This Job analysis. Must be run in Java, between jobs, in the same code which calls ToolRunner and JobClient. In essence, we need to collect descriptive statistics about task counts and times for map, shuffle, reduce. After tracing the flow of the JSP in src/webapps/job... Is there a better way to get at this data, *not* from the web UI perspective but from the code? Tried to find any applicable patterns in JobTracker, ClusterStatus, JobClient, etc., but no joy. Thanks, Paco
Re: Using MapReduce to do table comparing.
This is merely an in the ballpark calculation, regarding that 10 minute / 4-node requirement... We have a reasonably similar Hadoop job (slightly more complex in the reduce phase) running on AWS with: * 100+2 nodes (m1.xl config) * approx 3x the number of rows and data size * completes in 6 minutes You have faster processors. So it might require more on the order of 25-35 nodes for 10 min completion. That's a very rough estimate. Those other two steps (deletes, inserts) might be performed in the same pass as the compares -- and potentially quicker overall, when you consider the time to load and recreate in the RDBMS. Paco On Wed, Jul 23, 2008 at 9:33 AM, Amber [EMAIL PROTECTED] wrote: We have a 10 million row table exported from AS400 mainframe every day, the table is exported as a csv text file, which is about 30GB in size, then the csv file is imported into a RDBMS table which is dropped and recreated every day. Now we want to find how many rows are updated during each export-import interval, the table has a primary key, so deletes and inserts can be found using RDBMS joins quickly, but we must do a column to column comparing in order to find the difference between rows ( about 90%) with the same primary keys. Our goal is to find a comparing process which takes no more than 10 minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz CPUs, 8GB memory and a 300G local RAID5 array.
Re: Large Weblink Graph
Another site which has data sets available for study is UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/ On Tue, Apr 15, 2008 at 8:29 AM, Chaman Singh Verma [EMAIL PROTECTED] wrote: Does anyone have large Weblink graph ? I want to experiment and benchmark MapReduce with some real dataset.
Re: walkthrough of developing first hadoop app from scratch
Hi Stephen, Here's a sample Hadoop app which has its build based on Ant: http://code.google.com/p/ceteri-mapred/ Look in the jyte directory. A target called prep.jar simply uses the jar/ task in Ant to build a JAR for Hadoop to use. Yeah, I agree that docs and discussions seem to lean more toward systems engineering and are sparse about writing applications. A few folks are trying to change that :) On Fri, Mar 21, 2008 at 6:35 PM, Stephen J. Barr [EMAIL PROTECTED] wrote: Hello, I am working on developing my first hadoop app from scratch. It is a Monte-Carlo simulation, and I am using the PiEstimator code from the examples as a reference. I believe I have what I want in a .java file. However, I couldn't find any documentation on how to make that .java file into a .jar that I could run, and I haven't found much documentation that is hadoop specific. Is it basically javac MyApp.java jar -cf MyApp or something to that effect, or is there more to it? Thanks! Sorry for the newbie question. -stephen barr
Re: Add your project or company to the powered by page?
More on the subject of outreach, not specific uses at companies, but... A couple things might help get the word out: - Add a community group in LinkedIn (shows up on profile searches) http://www.linkedin.com/static?key=groups_faq - Add a link on the wiki to the Facebook group about Hadoop http://www.facebook.com/pages/Hadoop/9887781514 There's also a small but growing network of local user groups for Amazon AWS, and much interest there for presentations and discussions about Hadoop: http://www.amazon.com/Upcoming-Events-AWS-home-page/b/ref=sc_fe_c_0_371080011_1/103-5668663-1566203?ie=UTF8node=16284451no=371080011me=A36L942TSJ2AJA I'd be happy to help with any of those. Paco On Wed, Feb 20, 2008 at 10:26 PM, Eric Baldeschwieler [EMAIL PROTECTED] wrote: Hi Folks, Let's get the word out that Hadoop is being used and is useful in your organizations, ok? Please add yourselves to the Hadoop powered by page, or reply to this email with what details you would like to add and I'll do it. http://wiki.apache.org/hadoop/PoweredBy Thanks! E14 --- eric14 a.k.a. Eric Baldeschwieler senior director, grid computing Yahoo! Inc.