Re: Cloudera EC2 scripts

2010-05-27 Thread Andrew Nguyen
I didn't have any problems using the scripts that are in CDH3 (beta, March 2010) to bring up and tear down Hadoop cluster instances with EC2. I think there were some differences between the documentation and the actual scripts but it's been a few weeks and I don't have access to my notes right

Re: dfs.name.dir capacity for namenode backup?

2010-05-18 Thread Andrew Nguyen
Sorry to hijack but after following this thread, I had a related question to the secondary location of dfs.name.dir. Is the approach outlined below the preferred/suggested way to do this? Is this people mean when they say, "stick it on NFS" ? Thanks! On May 17, 2010, at 11:14 PM, Todd Lipco

Re: Setting up a second cluster and getting a weird issue

2010-05-17 Thread Andrew Nguyen
Sorry for bothering everyone, I accidentally configured my dfs.data.dir and mapred.local.dir to the same directory... Bad copy/paste job. Thanks for everyone's help!

Re: Setting up a second cluster and getting a weird issue

2010-05-17 Thread Andrew Nguyen
re the hadoop files via NFS. >>>> The log and pid directories are local. >>>> >>>> Thanks! >>>> >>>> --Andrew >>>> >>>> On May 12, 2010, at 7:40 PM, Jeff Zhang wrote: >>>> >>

Re: Setting up a second cluster and getting a weird issue

2010-05-14 Thread Andrew Nguyen
at 6:51 PM, Jeff Zhang wrote: >> >>> It is not suggested to deploy hadoop on NFS, there will be conflict >>> between data nodes, because NFS share the same namespace of file >>> system. >>> >>> >>> >>> On Thu

Re: Setting up a second cluster and getting a weird issue

2010-05-14 Thread Andrew Nguyen
14, 2010, at 1:06 PM, Andrew Nguyen wrote: > I'm pretty sure I just set my dfs.data.dir to be /srv/hadoop/dfs/1 > > > dfs.data.dir > /srv/hadoop/dfs/1 > > > I don't have hadoop.tmp.dir set to anything so it's whatever the default is. > > I don'

Re: Setting up a second cluster and getting a weird issue

2010-05-14 Thread Andrew Nguyen
red NFS was the fastest way to propagate changes. Thanks! On May 14, 2010, at 9:17 AM, Allen Wittenauer wrote: > > On May 14, 2010, at 8:53 AM, Andrew Nguyen wrote: > >> Just to be clear, I'm only sharing the Hadoop binaries and config files via >> NFS. I don

Re: Setting up a second cluster and getting a weird issue

2010-05-14 Thread Andrew Nguyen
, 2010, at 6:51 PM, Jeff Zhang wrote: > It is not suggested to deploy hadoop on NFS, there will be conflict > between data nodes, because NFS share the same namespace of file > system. > > > > On Thu, May 13, 2010 at 9:52 PM, Andrew Nguyen wrote: >> >> Yes,

Re: Setting up a second cluster and getting a weird issue

2010-05-13 Thread Andrew Nguyen
Yes, in this deployment, I'm attempting to share the hadoop files via NFS. The log and pid directories are local. Thanks! --Andrew On May 12, 2010, at 7:40 PM, Jeff Zhang wrote: > These 4 nodes share NFS ? > > > On Thu, May 13, 2010 at 8:19 AM, Andrew Nguyen > wrot

Setting up a second cluster and getting a weird issue

2010-05-12 Thread Andrew Nguyen
I'm working on bringing up a second test cluster and am getting these intermittent errors on the DataNodes: 2010-05-12 17:17:15,094 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.FileNotFoundException: /srv/hadoop/dfs/1/current/VERSION (No such file or directory) at java

Re: HDF5 and Hadoop

2010-05-03 Thread Andrew Nguyen
Chris, Thanks for the heads up! --Andrew On May 3, 2010, at 10:45 AM, Mattmann, Chris A (388J) wrote: > Hi Andrew, > > There has been some work in the Tika [1] project recently on looking at > NetCDF4 [2] and HDF4/5 [3] and extracting metadata/text content from them. > Though this doesn't di

HDF5 and Hadoop

2010-05-03 Thread Andrew Nguyen
Does anyone know of any existing work integrating HDF5 (http://www.hdfgroup.org/HDF5/whatishdf5.html) with Hadoop? I don't know much about HDF5 but it was recently brought to my attention as a way to store high-density scientific data. Since I've confirmed that having Hadoop dramatically speed

Splitting input for mapper and contiguous data

2010-04-16 Thread Andrew Nguyen
As I may have mentioned, my main goal currently is the processing of physiologic data using hadoop and MR. The steps are: Convert ADC units to physical units (input is , output is Perform a peak detection to detect the systolic blood pressure (input is , output is but the output is only a s

Running TestDFSIO on an EC2 instance

2010-04-14 Thread Andrew Nguyen
And, I'm getting the following errors: 10/04/15 06:00:50 INFO mapred.JobClient: Task Id : attempt_201004150557_0001_m_00_1, Status : FAILED java.io.IOException: Cannot open filename /benchmarks/TestDFSIO/io_data/test_io_0 A bunch show up and then the job fails. Running the job directly on

Per-file block size

2010-04-13 Thread Andrew Nguyen
I thought I saw a way to specify the block size for individual files using the command-line using "hadoop dfs -put/copyFromLocal..." However, I can't seem to find the reference anywhere. I see that I can do it via the API but no references to a command-line mechanism. Am I just remembering so

Re: Optimal setup for a test problem

2010-04-13 Thread Andrew Nguyen
money flow... --Andrew On Tue, 13 Apr 2010 10:29:06 -0700, Todd Lipcon wrote: > On Mon, Apr 12, 2010 at 1:45 PM, Andrew Nguyen < > andrew-lists-had...@ucsfcti.org> wrote: > >> I don't think you can :-). Sorry, they are 100Mbps NIC's... I get >> 95Mbit/sec from one node

Re: Optimal setup for a test problem

2010-04-13 Thread Andrew Nguyen
Correction, they are 100Mbps NIC's... iperf shows that we're getting about 95 Mbits/sec from one node to another. On Apr 12, 2010, at 1:05 PM, Andrew Nguyen wrote: > @Todd: > > I do need the sorting behavior, eventually. However, I'll try it with zero > red

Re: Optimal setup for a test problem

2010-04-12 Thread Andrew Nguyen
I guess my question below can be rephrased as, "What is the absolute minimum hw requirements for me to still see 'better-than-a-single-machine' performance?" Thanks! On Apr 12, 2010, at 1:45 PM, Andrew Nguyen wrote: > I don't think you can :-). Sorry, they

Re: Optimal setup for a test problem

2010-04-12 Thread Andrew Nguyen
I don't think you can :-). Sorry, they are 100Mbps NIC's... I get 95Mbit/sec from one node to another with iperf. Should I still be expecting such dismal performance with just 100Mbps? On Apr 12, 2010, at 1:31 PM, Todd Lipcon wrote: > On Mon, Apr 12, 2010 at 1:05 PM, Andrew Nguy

Re: Optimal setup for a test problem

2010-04-12 Thread Andrew Nguyen
0 and you'll end up with a map > only job, which should be significantly faster. > > -Todd > > On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen < > andrew-lists-had...@ucsfcti.org> wrote: > > > Hello, > > > > I recently setup a 5 node clust

Optimal setup for a test problem

2010-04-12 Thread Andrew Nguyen
Hello, I recently setup a 5 node cluster (1 master, 4 slaves) and am looking to use it to process high volumes of patient physiologic data. As an initial exercise to gain a better understanding, I have attempted to run the following problem (which isn't the type of problem that Hadoop was real