RE: Hadoop UI beta
Stefan, Thanks for contributing this, this is very nice. We may and try and use the Hadoop-ui (web server part) as a XML data source to feed a web app showing user's the state of their jobs as this seems like a good simple webserver to customize for pulling job info to another server or via AJAX. Thanks! Josh Patterson TVA -Original Message- From: Stefan Podkowinski [mailto:spo...@gmail.com] Sent: Tuesday, March 31, 2009 7:12 AM To: core-user@hadoop.apache.org Subject: ANN: Hadoop UI beta Hello, I'd like to invite you to take a look at the recently released first beta of Hadoop UI, a graphical Flex/Java based client for Hadoop Core. Hadoo UI currently includes a HDFS file explorer and basic job tracking features. Get it here: http://code.google.com/p/hadoop-ui/ As this is the first release it may (and does) still contain bugs, but I'd like to give everyone the chance to send feedback as early as possible. Give it a try :) - Stefan
RE: Hadoop and Matlab
Sameer, I'd also be interested in that as well; We are constructing a hadoop cluster for energy data (PMU) for the NERC and we will be potentially running jobs for a number of groups and researchers. I know some researchers will know nothing of map reduce, yet are very keen on MatLab, so we're looking at ways to make that transition as smooth as possible. Josh Patterson TVA -Original Message- From: Sameer Tilak [mailto:sameer.u...@gmail.com] Sent: Tuesday, April 21, 2009 1:56 PM To: core-user@hadoop.apache.org Subject: Hadoop and Matlab Hi there, We're working on an image analysis project. The image processing code is written in Matlab. If I invoke that code from a shell script and then use that shell script within Hadoop streaming, will that work? Has anyone done something along these lines? Many thaks, --ST.
Map Rendering
We're looking into power grid visualization and were wondering if anyone could recommend a good java native lib (that plays nice with hadoop) to render some layers of geospatial data. At this point we have the cluster crunching our test data, formats, and data structures, and we're now looking at producing indexes and visualizations. We'd like to be able to watch the power grid over time (with a 'time slider') over the map and load the tiles in OpenLayers, OpenStreetMap, or VirtualEarth, so the engineers could go back and replay large amounts of high resolution PMU smart grid data, then zoom in/out, and use the time slider to replay it. So essentially we'll need to render the grid graph as a layer in tiles, and then each tile (at each level) through time. I'm hoping someone has done some work with hadoop and map tile generation and can save me some time in finding the right java lib. Suggestions on java lib for this? Josh Patterson TVA
RE: Small Test Data Sets
You are exactly right, there was a secondary contructor in my Reader class that was not setting its split start and length correctly, each one was just reading the whole file. I missed a silly one, thanks for the heads up! Josh Patterson TVA -Original Message- From: Enis Soztutar [mailto:enis@gmail.com] Sent: Wednesday, March 25, 2009 5:27 AM To: core-user@hadoop.apache.org Subject: Re: Small Test Data Sets Patterson, Josh wrote: I want to confirm something with the list that I'm seeing; I needed to confirm that my Reader was reading our file format correctly, so I created a MR job that simply output each K/V pair to the reducer, which then just wrote out each one to the output file. This allows me to check by hand that all K/V points of data from our file format are getting pulled out of the file correctly. I have setup our InputFormat, RecordReader, and Reader subclasses for our specific file format. While running some basic tests on a small (1meg) single file I noticed something odd --- I was getting 2 copies of each data point in the output file. Initially I thought my Reader was just somehow reading the data point and not moving the read head, but I verified that was not the case through a series of tests. I then went on to reason that since I had 2 mappers by default on my job, and only 1 input file, that each mapper must be reading the file independently. I then set the -m flag to 1, and I got the proper output; Is it safe to assume in testing on a file that is smaller than the block size that I should always use -m 1 in order to get proper block-mapper mapping? Also, should I assume that if you have more mappers than disk blocks involved that you will get duplicate values? I may have set something wrong, I just wanted to check. Thanks Josh Patterson TVA If you have developed your own inputformat, than the problem might be there. The job of the inputformat is to create input splits, and readers. For one file and two mappers, the input format should return two splits each representing half of the file. In your case, I assume you return two splits each containing the whole file. Is this the case? Enis
Small Test Data Sets
I want to confirm something with the list that I'm seeing; I needed to confirm that my Reader was reading our file format correctly, so I created a MR job that simply output each K/V pair to the reducer, which then just wrote out each one to the output file. This allows me to check by hand that all K/V points of data from our file format are getting pulled out of the file correctly. I have setup our InputFormat, RecordReader, and Reader subclasses for our specific file format. While running some basic tests on a small (1meg) single file I noticed something odd --- I was getting 2 copies of each data point in the output file. Initially I thought my Reader was just somehow reading the data point and not moving the read head, but I verified that was not the case through a series of tests. I then went on to reason that since I had 2 mappers by default on my job, and only 1 input file, that each mapper must be reading the file independently. I then set the -m flag to 1, and I got the proper output; Is it safe to assume in testing on a file that is smaller than the block size that I should always use -m 1 in order to get proper block-mapper mapping? Also, should I assume that if you have more mappers than disk blocks involved that you will get duplicate values? I may have set something wrong, I just wanted to check. Thanks Josh Patterson TVA
RE: RecordReader design heuristic
Jeff, Yeah, the mapper sitting on a dfs block is pretty cool. Also, yes, we are about to start crunching on a lot of energy smart grid data. TVA is sorta like Switzerland for smart grid power generation and transmission data across the nation. Right now we have about 12TB, and this is slated to be around 30TB by the end of next 2010 (possibly more, depending on how many more PMUs come online). I am very interested in Mahout and have read up on it, it has many algorithms that I am familiar with from grad school. I will be doing some very simple MR jobs early on like finding the average frequency for a range of data, and I've been selling various groups internally on what CAN be done with good data mining and tools like Hadoop/Mahout. Our production cluster wont be online for a few more weeks, but that part is already rolling so I've moved on to focus on designing the first jobs to find quality results/benefits that I can sell in order to campaign for more ambitious projects I have drawn up. I know time series data lends itself to many machine learning applications, so, yes, I would be very interested in talking to anyone who wants to talk or share notes on hadoop and machine learning. I believe Mahout can be a tremendous resource for us and definitely plan on running and contributing to it. Josh Patterson TVA -Original Message- From: Jeff Eastman [mailto:j...@windwardsolutions.com] Sent: Wednesday, March 18, 2009 12:02 PM To: core-user@hadoop.apache.org Subject: Re: RecordReader design heuristic Hi Josh, It seemed like you had a conceptual wire crossed and I'm glad to help out. The neat thing about Hadoop mappers is - since they are given a replicated HDFS block to munch on - the job scheduler has replication factor number of node choices where it can run each mapper. This means mappers are always reading from local storage. On another note, I notice you are processing what looks to be large quantities of vector data. If you have any interest in clustering this data you might want to look at the Mahout project (http://lucene.apache.org/mahout/). We have a number of Hadoop-ready clustering algorithms, including a new non-parametric Dirichlet Process Clustering implementation that I committed recently. We are pulling it all together for a 0.1 release and I would be very interested in helping you to apply these algorithms if you have an interest. Jeff Patterson, Josh wrote: Jeff, ok, that makes more sense, I was under the mis-impression that it was creating and destroying mappers for each input record. I dont know why I had that in my head. My design suddenly became a lot clearer, and this provides a much more clean abstraction. Thanks for your help! Josh Patterson TVA
RE: RecordReader design heuristic
Hi Tom, Yeah, I'm assuming the splits are going to be about a single dfs block size (64M here). Each file I'm working with is around 1.5GB in size, and has a sort of File Allocation Table at the very end which tells you the block sizes inside the file, and then some other info. Once I pull that info out of the tail end of the file, I can calculate what internal blocks lie inside the split byte ranges, pull those out and push the individual data points up to the mapper, as well as deal with any block that falls over the split range (I'm assuming right now I'll use the same idea as the line-oriented reader, and just read all blocks that fall over the end point of the split, unless its the first split section). I guess the only hit I'm going to take here is having to ask the dfs for a quick read into the last 16 bytes of the whole file where my file info is stored. Splitting this file format doesn't seem to be so bad, its just finding which multiples of the internal file block size fits inside the split range, its just getting that multiple factor beforehand. After I get some mechanics of the process down, and I show the team some valid results, I may be able to talk them into going to another format that works better with MR. If anyone has any ideas on what file formats works best for storing and processing large amounts of time series points with MR, I'm all ears. We're moving towards a new philosophy wrt big data so it's a good time for us to examine best practices going forward. Josh Patterson TVA -Original Message- From: Tom White [mailto:t...@cloudera.com] Sent: Wednesday, March 18, 2009 1:21 PM To: core-user@hadoop.apache.org Subject: Re: RecordReader design heuristic Hi Josh, The other aspect to think about when writing your own record reader is input splits. As Jeff mentioned you really want mappers to be processing about one HDFS block's worth of data. If your inputs are significantly smaller, the overhead of creating mappers will be high and your jobs will be inefficient. On the other hand, if your inputs are significantly larger then you need to split them otherwise each mapper will take a very long time processing each split. Some file formats are inherently splittable, meaning you can re-align with record boundaries from an arbitrary point in the file. Examples include line-oriented text (split at newlines), and bzip2 (has a unique block marker). If your format is splittable then you will be able to take advantage of this to make MR processing more efficient. Cheers, Tom On Wed, Mar 18, 2009 at 5:00 PM, Patterson, Josh jpatters...@tva.gov wrote: Jeff, Yeah, the mapper sitting on a dfs block is pretty cool. Also, yes, we are about to start crunching on a lot of energy smart grid data. TVA is sorta like Switzerland for smart grid power generation and transmission data across the nation. Right now we have about 12TB, and this is slated to be around 30TB by the end of next 2010 (possibly more, depending on how many more PMUs come online). I am very interested in Mahout and have read up on it, it has many algorithms that I am familiar with from grad school. I will be doing some very simple MR jobs early on like finding the average frequency for a range of data, and I've been selling various groups internally on what CAN be done with good data mining and tools like Hadoop/Mahout. Our production cluster wont be online for a few more weeks, but that part is already rolling so I've moved on to focus on designing the first jobs to find quality results/benefits that I can sell in order to campaign for more ambitious projects I have drawn up. I know time series data lends itself to many machine learning applications, so, yes, I would be very interested in talking to anyone who wants to talk or share notes on hadoop and machine learning. I believe Mahout can be a tremendous resource for us and definitely plan on running and contributing to it. Josh Patterson TVA -Original Message- From: Jeff Eastman [mailto:j...@windwardsolutions.com] Sent: Wednesday, March 18, 2009 12:02 PM To: core-user@hadoop.apache.org Subject: Re: RecordReader design heuristic Hi Josh, It seemed like you had a conceptual wire crossed and I'm glad to help out. The neat thing about Hadoop mappers is - since they are given a replicated HDFS block to munch on - the job scheduler has replication factor number of node choices where it can run each mapper. This means mappers are always reading from local storage. On another note, I notice you are processing what looks to be large quantities of vector data. If you have any interest in clustering this data you might want to look at the Mahout project (http://lucene.apache.org/mahout/). We have a number of Hadoop-ready clustering algorithms, including a new non-parametric Dirichlet Process Clustering implementation that I committed recently. We are pulling it all together for a 0.1 release and I would be very interested
RecordReader design heuristic
I am currently working on a RecordReader to read a custom time series data binary file format and was wondering about ways to be most efficient in designing the InputFormat/RecordReader process. Reading through: http://wiki.apache.org/hadoop/HadoopMapReduce http://wiki.apache.org/hadoop/HadoopMapReduce gave me a lot of hints about how the various classes work together in order to read any type of file. I was looking at how the TextInputFormat uses the LineRecordReader in order to send individual lines to each mapper. My question is, what is a good heuristic in how to choose how much data to send to each mapper? With the stock LineRecordReader each mapper only gets to work with a single line which leads me to believe that we want to give each mapper very little work. Currently I'm looking at either sending each mapper a single point of data (10 bytes), which seems small, or sending a single mapper a block of data (around 819 points, at 10 bytes each, --- 8190 bytes). I'm leaning towards sending the block to the mapper. These factors are based around dealing with a legacy file format (for now) so I'm just trying to make the best tradeoff possible for the short term until I get some basic stuff rolling, at which point I can suggest a better storage format, or just start converting the groups of stored points into a format more fitting for the platform. I understand that the InputFormat is not really trying to make much meaning out of the data, other than to help assist in getting the correct data out of the file based on the file split variables. Another question I have is, with a pretty much stock install, generally how big is each FileSplit? Josh Patterson TVA
RE: RecordReader design heuristic
Jeff, So if I'm hearing you right, its good to send one point of data (10 bytes here) to a single mapper? This mind set increases the number of mappers, but keeps their logic scaled down to simply look at this record and emit/don't emit --- which is considered more favorable? I'm still getting the hang of the MR design tradeoffs, thanks for your feedback. Josh Patterson TVA -Original Message- From: Jeff Eastman [mailto:j...@windwardsolutions.com] Sent: Tuesday, March 17, 2009 5:11 PM To: core-user@hadoop.apache.org Subject: Re: RecordReader design heuristic If you send a single point to the mapper, your mapper logic will be clean and simple. Otherwise you will need to loop over your block of points in the mapper. In Mahout clustering, I send the mapper individual points because the input file is point-per-line. In either case, the record reader will be iterating over a block of data to provide mapper inputs. IIRC, splits will generally be an HDFS block or less, so if you have files smaller than that you will get one mapper per. For larger files you can get up to one mapper per split block. Jeff Patterson, Josh wrote: I am currently working on a RecordReader to read a custom time series data binary file format and was wondering about ways to be most efficient in designing the InputFormat/RecordReader process. Reading through: http://wiki.apache.org/hadoop/HadoopMapReduce http://wiki.apache.org/hadoop/HadoopMapReduce gave me a lot of hints about how the various classes work together in order to read any type of file. I was looking at how the TextInputFormat uses the LineRecordReader in order to send individual lines to each mapper. My question is, what is a good heuristic in how to choose how much data to send to each mapper? With the stock LineRecordReader each mapper only gets to work with a single line which leads me to believe that we want to give each mapper very little work. Currently I'm looking at either sending each mapper a single point of data (10 bytes), which seems small, or sending a single mapper a block of data (around 819 points, at 10 bytes each, --- 8190 bytes). I'm leaning towards sending the block to the mapper. These factors are based around dealing with a legacy file format (for now) so I'm just trying to make the best tradeoff possible for the short term until I get some basic stuff rolling, at which point I can suggest a better storage format, or just start converting the groups of stored points into a format more fitting for the platform. I understand that the InputFormat is not really trying to make much meaning out of the data, other than to help assist in getting the correct data out of the file based on the file split variables. Another question I have is, with a pretty much stock install, generally how big is each FileSplit? Josh Patterson TVA
RE: RecordReader design heuristic
Jeff, ok, that makes more sense, I was under the mis-impression that it was creating and destroying mappers for each input record. I dont know why I had that in my head. My design suddenly became a lot clearer, and this provides a much more clean abstraction. Thanks for your help! Josh Patterson TVA -Original Message- From: Jeff Eastman [mailto:j...@windwardsolutions.com] Sent: Tue 03/17/2009 6:02 PM To: core-user@hadoop.apache.org Subject: Re: RecordReader design heuristic Hi Josh, Well, I don't really see how you will get more mappers, just simpler logic in the mapper. The number of mappers is driven by how many input files you have and their sizes and not by any chunking you do in the record reader. Each record reader will get an entire split and will feed it to its mapper in a stream one record at a time. You can duplicate some of that logic in the mapper if you want but you already will have it in the reader so why bother? Jeff Patterson, Josh wrote: Jeff, So if I'm hearing you right, its good to send one point of data (10 bytes here) to a single mapper? This mind set increases the number of mappers, but keeps their logic scaled down to simply look at this record and emit/don't emit --- which is considered more favorable? I'm still getting the hang of the MR design tradeoffs, thanks for your feedback. Josh Patterson TVA -Original Message- From: Jeff Eastman [mailto:j...@windwardsolutions.com] Sent: Tuesday, March 17, 2009 5:11 PM To: core-user@hadoop.apache.org Subject: Re: RecordReader design heuristic If you send a single point to the mapper, your mapper logic will be clean and simple. Otherwise you will need to loop over your block of points in the mapper. In Mahout clustering, I send the mapper individual points because the input file is point-per-line. In either case, the record reader will be iterating over a block of data to provide mapper inputs. IIRC, splits will generally be an HDFS block or less, so if you have files smaller than that you will get one mapper per. For larger files you can get up to one mapper per split block. Jeff Patterson, Josh wrote: I am currently working on a RecordReader to read a custom time series data binary file format and was wondering about ways to be most efficient in designing the InputFormat/RecordReader process. Reading through: http://wiki.apache.org/hadoop/HadoopMapReduce http://wiki.apache.org/hadoop/HadoopMapReduce gave me a lot of hints about how the various classes work together in order to read any type of file. I was looking at how the TextInputFormat uses the LineRecordReader in order to send individual lines to each mapper. My question is, what is a good heuristic in how to choose how much data to send to each mapper? With the stock LineRecordReader each mapper only gets to work with a single line which leads me to believe that we want to give each mapper very little work. Currently I'm looking at either sending each mapper a single point of data (10 bytes), which seems small, or sending a single mapper a block of data (around 819 points, at 10 bytes each, --- 8190 bytes). I'm leaning towards sending the block to the mapper. These factors are based around dealing with a legacy file format (for now) so I'm just trying to make the best tradeoff possible for the short term until I get some basic stuff rolling, at which point I can suggest a better storage format, or just start converting the groups of stored points into a format more fitting for the platform. I understand that the InputFormat is not really trying to make much meaning out of the data, other than to help assist in getting the correct data out of the file based on the file split variables. Another question I have is, with a pretty much stock install, generally how big is each FileSplit? Josh Patterson TVA
RE: Issues installing FUSE_DFS
Brian, Do you know of anyone using Samba to access the FUSE-DFS mount point via windows? We have FUSE-DFS working, but read/write doesn't work via Samba. Josh Patterson -Original Message- From: Brian Bockelman [mailto:bbock...@cse.unl.edu] Sent: Tuesday, March 03, 2009 11:26 AM To: core-user@hadoop.apache.org Subject: Re: Issues installing FUSE_DFS On Mar 3, 2009, at 10:01 AM, Patterson, Josh wrote: Hey Brian, I'm working with Matthew on our hdfs install, and he's doing the server admin on this project; We just tried the settings you suggested, and we got the following error: [r...@socdvmhdfs1 ~]# fuse_dfs -oserver=socdvmhdfs1 -oport=9000 /hdfs -oallow_ot her -ordbuffer=131072 fuse-dfs didn't recognize /hdfs,-2 fuse-dfs ignoring option allow_other [r...@socdvmhdfs1 ~]# df -h FilesystemSize Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 5.8G 3.7G 1.9G 67% / /dev/sda1 99M 18M 77M 19% /boot tmpfs 506M 0 506M 0% /dev/shm df: `/hdfs': Input/output error So, you have a successful mount, but for some reason it's not talking to Hadoop. Try adding the -d flag; this will mount FUSE in debug mode, and any Hadoop errors (or whatever is causing the Input/output error) will be printed to the terminal. Brian We are using Redhat EL5 and hadoop 0.19; We did have some trouble compiling FUSE-DFS but got through the compilation errors. Any advice on what to try next? Josh Patterson TVA -Original Message- From: Brian Bockelman [mailto:bbock...@cse.unl.edu] Sent: Monday, March 02, 2009 5:30 PM To: core-user@hadoop.apache.org Subject: Re: Issues installing FUSE_DFS Hey Matthew, We use the following command on 0.19.0: fuse_dfs -oserver=hadoop-name -oport=9000 /mnt/hadoop -oallow_other - ordbufffer=131072 Brian On Mar 2, 2009, at 4:12 PM, Hyatt, Matthew G wrote: When we try to mount the dfs from fuse we are getting the following errors. Has anyone seen this issues in the past? This is on version 0.19.0 [r...@socdvmhdfs1]# fuse_dfs dfs://socdvmhdfs1:9000 /hdfs port=9000,server=socdvmhdfs1 fuse-dfs didn't recognize /hdfs,-2 [r...@socdvmhdfs1]# df -h FilesystemSize Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 5.8G 3.7G 1.9G 67% / /dev/sda1 99M 18M 77M 19% /boot tmpfs 506M 0 506M 0% /dev/shm df: `/hdfs': Input/output error Matthew G. Hyatt Tennessee Valley Authority TRO Configuration Management System Admin UNIX/NT phone: 423-751-4189 e-mail: mghy...@tva.gov