RE: Hadoop UI beta

2009-04-22 Thread Patterson, Josh
Stefan,
Thanks for contributing this, this is very nice. We may and try and use
the Hadoop-ui (web server part) as a XML data source to feed a web app
showing user's the state of their jobs as this seems like a good simple
webserver to customize for pulling job info to another server or via
AJAX. Thanks!

Josh Patterson
TVA 

-Original Message-
From: Stefan Podkowinski [mailto:spo...@gmail.com] 
Sent: Tuesday, March 31, 2009 7:12 AM
To: core-user@hadoop.apache.org
Subject: ANN: Hadoop UI beta

Hello,

I'd like to invite you to take a look at the recently released first
beta of Hadoop UI, a graphical Flex/Java based client for Hadoop Core.
Hadoo UI currently includes a HDFS file explorer and basic job
tracking features.

Get it here:
http://code.google.com/p/hadoop-ui/

As this is the first release it may (and does) still contain bugs, but
I'd like to give everyone the chance to send feedback as early as
possible.
Give it a try :)

- Stefan


RE: Hadoop and Matlab

2009-04-21 Thread Patterson, Josh
Sameer,
I'd also be interested in that as well; We are constructing a hadoop
cluster for energy data (PMU) for the NERC and we will be potentially
running jobs for a number of groups and researchers. I know some
researchers will know nothing of map reduce, yet are very keen on
MatLab, so we're looking at ways to make that transition as smooth as
possible. 

Josh Patterson
TVA

-Original Message-
From: Sameer Tilak [mailto:sameer.u...@gmail.com] 
Sent: Tuesday, April 21, 2009 1:56 PM
To: core-user@hadoop.apache.org
Subject: Hadoop and Matlab

Hi there,

We're working on an image analysis project. The image processing code is
written in Matlab. If I invoke that code from a shell script and then
use
that shell script within Hadoop streaming, will that work? Has anyone
done
something along these lines?

Many thaks,
--ST.


Map Rendering

2009-04-13 Thread Patterson, Josh
We're looking into power grid visualization and were wondering if anyone
could recommend a good java native lib (that plays nice with hadoop) to
render some layers of geospatial data. At this point we have the cluster
crunching our test data, formats, and data structures, and we're now
looking at producing indexes and visualizations. We'd like to be able to
watch the power grid over time (with a 'time slider') over the map and
load the tiles in OpenLayers, OpenStreetMap, or VirtualEarth, so the
engineers could go back and replay large amounts of high resolution PMU
smart grid data, then zoom in/out, and use the time slider to replay it.
So essentially we'll need to render the grid graph as a layer in
tiles, and then each tile (at each level) through time. I'm hoping
someone has done some work with hadoop and map tile generation and can
save me some time in finding the right java lib. Suggestions on java lib
for this?
 
Josh Patterson
TVA


RE: Small Test Data Sets

2009-03-25 Thread Patterson, Josh
You are exactly right, there was a secondary contructor in my Reader
class that was not setting its split start and length correctly, each
one was just reading the whole file. I missed a silly one, thanks for
the heads up!

Josh Patterson
TVA 

-Original Message-
From: Enis Soztutar [mailto:enis@gmail.com] 
Sent: Wednesday, March 25, 2009 5:27 AM
To: core-user@hadoop.apache.org
Subject: Re: Small Test Data Sets

Patterson, Josh wrote:
 I want to confirm something with the list that I'm seeing;
  
 I needed to confirm that my Reader was reading our file format
 correctly, so I created a MR job that simply output each K/V pair to
the
 reducer, which then just wrote out each one to the output file. This
 allows me to check by hand that all K/V points of data from our file
 format are getting pulled out of the file correctly. I have setup our
 InputFormat, RecordReader, and Reader subclasses for our specific file
 format.
  
 While running some basic tests on a small (1meg) single file I noticed
 something odd --- I was getting 2 copies of each data point in the
 output file. Initially I thought my Reader was just somehow reading
the
 data point and not moving the read head, but I verified that was not
the
 case through a series of tests.
  
 I then went on to reason that since I had 2 mappers by default on my
 job, and only 1 input file, that each mapper must be reading the file
 independently. I then set the -m flag to 1, and I got the proper
output;
 Is it safe to assume in testing on a file that is smaller than the
block
 size that I should always use -m 1 in order to get proper
block-mapper
 mapping? Also, should I assume that if you have more mappers than disk
 blocks involved that you will get duplicate values? I may have set
 something wrong, I just wanted to check. Thanks
  
 Josh Patterson
 TVA
  

   
If you have developed your own inputformat, than the problem might be 
there.
The job of the inputformat is to create input splits, and readers. For 
one file and
two mappers, the input format should return two splits each representing

half of
the file. In your case, I assume you return two splits each containing 
the whole file.
Is this the case?

Enis


Small Test Data Sets

2009-03-24 Thread Patterson, Josh
I want to confirm something with the list that I'm seeing;
 
I needed to confirm that my Reader was reading our file format
correctly, so I created a MR job that simply output each K/V pair to the
reducer, which then just wrote out each one to the output file. This
allows me to check by hand that all K/V points of data from our file
format are getting pulled out of the file correctly. I have setup our
InputFormat, RecordReader, and Reader subclasses for our specific file
format.
 
While running some basic tests on a small (1meg) single file I noticed
something odd --- I was getting 2 copies of each data point in the
output file. Initially I thought my Reader was just somehow reading the
data point and not moving the read head, but I verified that was not the
case through a series of tests.
 
I then went on to reason that since I had 2 mappers by default on my
job, and only 1 input file, that each mapper must be reading the file
independently. I then set the -m flag to 1, and I got the proper output;
Is it safe to assume in testing on a file that is smaller than the block
size that I should always use -m 1 in order to get proper block-mapper
mapping? Also, should I assume that if you have more mappers than disk
blocks involved that you will get duplicate values? I may have set
something wrong, I just wanted to check. Thanks
 
Josh Patterson
TVA
 


RE: RecordReader design heuristic

2009-03-18 Thread Patterson, Josh
Jeff,
Yeah, the mapper sitting on a dfs block is pretty cool.

Also, yes, we are about to start crunching on a lot of energy smart grid
data. TVA is sorta like Switzerland for smart grid power generation
and transmission data across the nation. Right now we have about 12TB,
and this is slated to be around 30TB by the end of next 2010 (possibly
more, depending on how many more PMUs come online). I am very interested
in Mahout and have read up on it, it has many algorithms that I am
familiar with from grad school. I will be doing some very simple MR jobs
early on like finding the average frequency for a range of data, and
I've been selling various groups internally on what CAN be done with
good data mining and tools like Hadoop/Mahout. Our production cluster
wont be online for a few more weeks, but that part is already rolling so
I've moved on to focus on designing the first jobs to find quality
results/benefits that I can sell in order to campaign for more
ambitious projects I have drawn up. I know time series data lends itself
to many machine learning applications, so, yes, I would be very
interested in talking to anyone who wants to talk or share notes on
hadoop and machine learning. I believe Mahout can be a tremendous
resource for us and definitely plan on running and contributing to it.

Josh Patterson
TVA

-Original Message-
From: Jeff Eastman [mailto:j...@windwardsolutions.com] 
Sent: Wednesday, March 18, 2009 12:02 PM
To: core-user@hadoop.apache.org
Subject: Re: RecordReader design heuristic

Hi Josh,
It seemed like you had a conceptual wire crossed and I'm glad to help 
out. The neat thing about Hadoop mappers is - since they are given a 
replicated HDFS block to munch on - the job scheduler has replication 
factor number of node choices where it can run each mapper. This means 
mappers are always reading from local storage.

On another note, I notice you are processing what looks to be large 
quantities of vector data. If you have any interest in clustering this 
data you might want to look at the Mahout project 
(http://lucene.apache.org/mahout/). We have a number of Hadoop-ready 
clustering algorithms, including a new non-parametric Dirichlet Process 
Clustering implementation that I committed recently. We are pulling it 
all together for a 0.1 release and I would be very interested in helping

you to apply these algorithms if you have an interest.

Jeff


Patterson, Josh wrote:
 Jeff,
 ok, that makes more sense, I was under the mis-impression that it was
creating and destroying mappers for each input record. I dont know why I
had that in my head. My design suddenly became a lot clearer, and this
provides a much more clean abstraction. Thanks for your help!

 Josh Patterson
 TVA

   



RE: RecordReader design heuristic

2009-03-18 Thread Patterson, Josh
Hi Tom,
Yeah, I'm assuming the splits are going to be about a single dfs block
size (64M here). Each file I'm working with is around 1.5GB in size, and
has a sort of File Allocation Table at the very end which tells you the
block sizes inside the file, and then some other info. Once I pull that
info out of the tail end of the file, I can calculate what internal
blocks lie inside the split byte ranges, pull those out and push the
individual data points up to the mapper, as well as deal with any block
that falls over the split range (I'm assuming right now I'll use the
same idea as the line-oriented reader, and just read all blocks that
fall over the end point of the split, unless its the first split
section). I guess the only hit I'm going to take here is having to ask
the dfs for a quick read into the last 16 bytes of the whole file where
my file info is stored. Splitting this file format doesn't seem to be so
bad, its just finding which multiples of the internal file block size
fits inside the split range, its just getting that multiple factor
beforehand.

After I get some mechanics of the process down, and I show the team some
valid results, I may be able to talk them into going to another format
that works better with MR. If anyone has any ideas on what file formats
works best for storing and processing large amounts of time series
points with MR, I'm all ears. We're moving towards a new philosophy wrt
big data so it's a good time for us to examine best practices going
forward.

Josh Patterson
TVA

-Original Message-
From: Tom White [mailto:t...@cloudera.com] 
Sent: Wednesday, March 18, 2009 1:21 PM
To: core-user@hadoop.apache.org
Subject: Re: RecordReader design heuristic

Hi Josh,

The other aspect to think about when writing your own record reader is
input splits. As Jeff mentioned you really want mappers to be
processing about one HDFS block's worth of data. If your inputs are
significantly smaller, the overhead of creating mappers will be high
and your jobs will be inefficient. On the other hand, if your inputs
are significantly larger then you need to split them otherwise each
mapper will take a very long time processing each split. Some file
formats are inherently splittable, meaning you can re-align with
record boundaries from an arbitrary point in the file. Examples
include line-oriented text (split at newlines), and bzip2 (has a
unique block marker). If your format is splittable then you will be
able to take advantage of this to make MR processing more efficient.

Cheers,
Tom

On Wed, Mar 18, 2009 at 5:00 PM, Patterson, Josh jpatters...@tva.gov
wrote:
 Jeff,
 Yeah, the mapper sitting on a dfs block is pretty cool.

 Also, yes, we are about to start crunching on a lot of energy smart
grid
 data. TVA is sorta like Switzerland for smart grid power generation
 and transmission data across the nation. Right now we have about 12TB,
 and this is slated to be around 30TB by the end of next 2010 (possibly
 more, depending on how many more PMUs come online). I am very
interested
 in Mahout and have read up on it, it has many algorithms that I am
 familiar with from grad school. I will be doing some very simple MR
jobs
 early on like finding the average frequency for a range of data, and
 I've been selling various groups internally on what CAN be done with
 good data mining and tools like Hadoop/Mahout. Our production cluster
 wont be online for a few more weeks, but that part is already rolling
so
 I've moved on to focus on designing the first jobs to find quality
 results/benefits that I can sell in order to campaign for more
 ambitious projects I have drawn up. I know time series data lends
itself
 to many machine learning applications, so, yes, I would be very
 interested in talking to anyone who wants to talk or share notes on
 hadoop and machine learning. I believe Mahout can be a tremendous
 resource for us and definitely plan on running and contributing to it.

 Josh Patterson
 TVA

 -Original Message-
 From: Jeff Eastman [mailto:j...@windwardsolutions.com]
 Sent: Wednesday, March 18, 2009 12:02 PM
 To: core-user@hadoop.apache.org
 Subject: Re: RecordReader design heuristic

 Hi Josh,
 It seemed like you had a conceptual wire crossed and I'm glad to help
 out. The neat thing about Hadoop mappers is - since they are given a
 replicated HDFS block to munch on - the job scheduler has replication
 factor number of node choices where it can run each mapper. This
means
 mappers are always reading from local storage.

 On another note, I notice you are processing what looks to be large
 quantities of vector data. If you have any interest in clustering this
 data you might want to look at the Mahout project
 (http://lucene.apache.org/mahout/). We have a number of Hadoop-ready
 clustering algorithms, including a new non-parametric Dirichlet
Process
 Clustering implementation that I committed recently. We are pulling it
 all together for a 0.1 release and I would be very interested

RecordReader design heuristic

2009-03-17 Thread Patterson, Josh
I am currently working on a RecordReader to read a custom time series
data binary file format and was wondering about ways to be most
efficient in designing the InputFormat/RecordReader process. Reading
through:
 
http://wiki.apache.org/hadoop/HadoopMapReduce
http://wiki.apache.org/hadoop/HadoopMapReduce 
 
gave me a lot of hints about how the various classes work together in
order to read any type of file. I was looking at how the TextInputFormat
uses the LineRecordReader in order to send individual lines to each
mapper. My question is, what is a good heuristic in how to choose how
much data to send to each mapper? With the stock LineRecordReader each
mapper only gets to work with a single line which leads me to believe
that we want to give each mapper very little work. Currently I'm looking
at either sending each mapper a single point of data (10 bytes), which
seems small, or sending a single mapper a block of data (around 819
points, at 10 bytes each, --- 8190 bytes). I'm leaning towards sending
the block to the mapper.
 
These factors are based around dealing with a legacy file format (for
now) so I'm just trying to make the best tradeoff possible for the short
term until I get some basic stuff rolling, at which point I can suggest
a better storage format, or just start converting the groups of stored
points into a format more fitting for the platform. I understand that
the InputFormat is not really trying to make much meaning out of the
data, other than to help assist in getting the correct data out of the
file based on the file split variables. Another question I have is, with
a pretty much stock install, generally how big is each FileSplit?
 
Josh Patterson
TVA


RE: RecordReader design heuristic

2009-03-17 Thread Patterson, Josh
Jeff,
So if I'm hearing you right, its good to send one point of data (10
bytes here) to a single mapper? This mind set increases the number of
mappers, but keeps their logic scaled down to simply look at this
record and emit/don't emit --- which is considered more favorable? I'm
still getting the hang of the MR design tradeoffs, thanks for your
feedback.

Josh Patterson
TVA

-Original Message-
From: Jeff Eastman [mailto:j...@windwardsolutions.com] 
Sent: Tuesday, March 17, 2009 5:11 PM
To: core-user@hadoop.apache.org
Subject: Re: RecordReader design heuristic

If you send a single point to the mapper, your mapper logic will be 
clean and simple. Otherwise you will need to loop over your block of 
points in the mapper. In Mahout clustering, I send the mapper individual

points because the input file is point-per-line. In either case, the 
record reader will be iterating over a block of data to provide mapper 
inputs. IIRC, splits will generally be an HDFS block or less, so if you 
have files smaller than that you will get one mapper per. For larger 
files you can get up to one mapper per split block.

Jeff

Patterson, Josh wrote:
 I am currently working on a RecordReader to read a custom time series
 data binary file format and was wondering about ways to be most
 efficient in designing the InputFormat/RecordReader process. Reading
 through:
  
 http://wiki.apache.org/hadoop/HadoopMapReduce
 http://wiki.apache.org/hadoop/HadoopMapReduce 
  
 gave me a lot of hints about how the various classes work together in
 order to read any type of file. I was looking at how the
TextInputFormat
 uses the LineRecordReader in order to send individual lines to each
 mapper. My question is, what is a good heuristic in how to choose how
 much data to send to each mapper? With the stock LineRecordReader each
 mapper only gets to work with a single line which leads me to believe
 that we want to give each mapper very little work. Currently I'm
looking
 at either sending each mapper a single point of data (10 bytes), which
 seems small, or sending a single mapper a block of data (around 819
 points, at 10 bytes each, --- 8190 bytes). I'm leaning towards
sending
 the block to the mapper.
  
 These factors are based around dealing with a legacy file format (for
 now) so I'm just trying to make the best tradeoff possible for the
short
 term until I get some basic stuff rolling, at which point I can
suggest
 a better storage format, or just start converting the groups of stored
 points into a format more fitting for the platform. I understand that
 the InputFormat is not really trying to make much meaning out of the
 data, other than to help assist in getting the correct data out of the
 file based on the file split variables. Another question I have is,
with
 a pretty much stock install, generally how big is each FileSplit?
  
 Josh Patterson
 TVA

   



RE: RecordReader design heuristic

2009-03-17 Thread Patterson, Josh
Jeff,
ok, that makes more sense, I was under the mis-impression that it was creating 
and destroying mappers for each input record. I dont know why I had that in my 
head. My design suddenly became a lot clearer, and this provides a much more 
clean abstraction. Thanks for your help!

Josh Patterson
TVA


-Original Message-
From: Jeff Eastman [mailto:j...@windwardsolutions.com]
Sent: Tue 03/17/2009 6:02 PM
To: core-user@hadoop.apache.org
Subject: Re: RecordReader design heuristic
 
Hi Josh,

Well, I don't really see how you will get more mappers, just simpler 
logic in the mapper. The number of mappers is driven by how many input 
files you have and their sizes and not by any chunking you do in the 
record reader. Each record reader will get an entire split and will feed 
it to its mapper in a stream one record at a time. You can duplicate 
some of that logic in the mapper if you want but you already will have 
it in the reader so why bother?

Jeff


Patterson, Josh wrote:
 Jeff,
 So if I'm hearing you right, its good to send one point of data (10
 bytes here) to a single mapper? This mind set increases the number of
 mappers, but keeps their logic scaled down to simply look at this
 record and emit/don't emit --- which is considered more favorable? I'm
 still getting the hang of the MR design tradeoffs, thanks for your
 feedback.

 Josh Patterson
 TVA

 -Original Message-
 From: Jeff Eastman [mailto:j...@windwardsolutions.com] 
 Sent: Tuesday, March 17, 2009 5:11 PM
 To: core-user@hadoop.apache.org
 Subject: Re: RecordReader design heuristic

 If you send a single point to the mapper, your mapper logic will be 
 clean and simple. Otherwise you will need to loop over your block of 
 points in the mapper. In Mahout clustering, I send the mapper individual

 points because the input file is point-per-line. In either case, the 
 record reader will be iterating over a block of data to provide mapper 
 inputs. IIRC, splits will generally be an HDFS block or less, so if you 
 have files smaller than that you will get one mapper per. For larger 
 files you can get up to one mapper per split block.

 Jeff

 Patterson, Josh wrote:
   
 I am currently working on a RecordReader to read a custom time series
 data binary file format and was wondering about ways to be most
 efficient in designing the InputFormat/RecordReader process. Reading
 through:
  
 http://wiki.apache.org/hadoop/HadoopMapReduce
 http://wiki.apache.org/hadoop/HadoopMapReduce 
  
 gave me a lot of hints about how the various classes work together in
 order to read any type of file. I was looking at how the
 
 TextInputFormat
   
 uses the LineRecordReader in order to send individual lines to each
 mapper. My question is, what is a good heuristic in how to choose how
 much data to send to each mapper? With the stock LineRecordReader each
 mapper only gets to work with a single line which leads me to believe
 that we want to give each mapper very little work. Currently I'm
 
 looking
   
 at either sending each mapper a single point of data (10 bytes), which
 seems small, or sending a single mapper a block of data (around 819
 points, at 10 bytes each, --- 8190 bytes). I'm leaning towards
 
 sending
   
 the block to the mapper.
  
 These factors are based around dealing with a legacy file format (for
 now) so I'm just trying to make the best tradeoff possible for the
 
 short
   
 term until I get some basic stuff rolling, at which point I can
 
 suggest
   
 a better storage format, or just start converting the groups of stored
 points into a format more fitting for the platform. I understand that
 the InputFormat is not really trying to make much meaning out of the
 data, other than to help assist in getting the correct data out of the
 file based on the file split variables. Another question I have is,
 
 with
   
 a pretty much stock install, generally how big is each FileSplit?
  
 Josh Patterson
 TVA

   
 



   




RE: Issues installing FUSE_DFS

2009-03-03 Thread Patterson, Josh
Brian,
Do you know of anyone using Samba to access the FUSE-DFS mount point via
windows? We have FUSE-DFS working, but read/write doesn't work via
Samba.

Josh Patterson 

-Original Message-
From: Brian Bockelman [mailto:bbock...@cse.unl.edu] 
Sent: Tuesday, March 03, 2009 11:26 AM
To: core-user@hadoop.apache.org
Subject: Re: Issues installing FUSE_DFS


On Mar 3, 2009, at 10:01 AM, Patterson, Josh wrote:

 Hey Brian,
 I'm working with Matthew on our hdfs install, and he's doing the  
 server
 admin on this project; We just tried the settings you suggested, and  
 we
 got the following error:




 [r...@socdvmhdfs1 ~]# fuse_dfs -oserver=socdvmhdfs1 -oport=9000 /hdfs
 -oallow_ot
 her -ordbuffer=131072
 fuse-dfs didn't recognize /hdfs,-2
 fuse-dfs ignoring option allow_other
 [r...@socdvmhdfs1 ~]# df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/mapper/VolGroup00-LogVol00
  5.8G  3.7G  1.9G  67% /
 /dev/sda1  99M   18M   77M  19% /boot
 tmpfs 506M 0  506M   0% /dev/shm
 df: `/hdfs': Input/output error


So, you have a successful mount, but for some reason it's not talking  
to Hadoop.

Try adding the -d flag; this will mount FUSE in debug mode, and any  
Hadoop errors (or whatever is causing the Input/output error) will be  
printed to the terminal.

Brian



 We are using Redhat EL5 and hadoop 0.19; We did have some trouble
 compiling FUSE-DFS but got through the compilation errors. Any  
 advice on
 what to try next?

 Josh Patterson
 TVA



 -Original Message-
 From: Brian Bockelman [mailto:bbock...@cse.unl.edu]
 Sent: Monday, March 02, 2009 5:30 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Issues installing FUSE_DFS

 Hey Matthew,

 We use the following command on 0.19.0:

 fuse_dfs -oserver=hadoop-name -oport=9000 /mnt/hadoop -oallow_other -
 ordbufffer=131072

 Brian

 On Mar 2, 2009, at 4:12 PM, Hyatt, Matthew G wrote:

 When we try to mount the dfs from fuse we are getting the following
 errors. Has anyone seen this issues in the past? This is on version
 0.19.0


 [r...@socdvmhdfs1]# fuse_dfs dfs://socdvmhdfs1:9000 /hdfs
 port=9000,server=socdvmhdfs1
 fuse-dfs didn't recognize /hdfs,-2

 [r...@socdvmhdfs1]# df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/mapper/VolGroup00-LogVol00
 5.8G  3.7G  1.9G  67% /
 /dev/sda1  99M   18M   77M  19% /boot
 tmpfs 506M 0  506M   0% /dev/shm
 df: `/hdfs': Input/output error


 Matthew G. Hyatt
 Tennessee Valley Authority
 TRO Configuration Management
 System Admin UNIX/NT
 phone: 423-751-4189
 e-mail: mghy...@tva.gov