Re: why does not hdfs read ahead ?

2009-11-25 Thread Steve Loughran
Michael Thomas wrote: Hey guys, During the SC09 exercise, our data transfer tool was using the FUSE interface to HDFS. As Brian said, we were also reading 16 files in parallel. This seemed to be the optimal number, beyond which the aggregate read rate did not improve. We have worked sched

native s3 file system - file split on S3 native file system and custom filesystem store

2009-11-25 Thread neova...@gmail.com
For security reasons I am required to conform and use a different S3 library that I am provided to access S3 data. If I write an adapter against the native file system store class to access S3 using my own library, do I still get the same benefits that I would get for using the default file system

Re: Master and slaves on hadoop/ec2

2009-11-25 Thread Tom White
Correct. The master runs the namenode and jobtracker, but not a datanode or tasktracker. Tom On Tue, Nov 24, 2009 at 4:57 PM, Mark Kerzner wrote: > Hi, > > do I understand it correctly that, when I launch a Hadoop cluster on EC2, > the master will not be doing any work, and it is just for organi

Re: How do I reference S3 from an EC2 Hadoop cluster?

2009-11-25 Thread Tom White
On Tue, Nov 24, 2009 at 9:27 PM, Mark Kerzner wrote: > Yes, Tom, I saw all these problems. I think that I should stop trying to > imitate EMR - that's where the storing data on S3 appeared, and transfer > data directly to the Hadoop cluster. Then I will be using all as intended. > > Is there a way

Re: Using hadoop for Matrix Multiplication in NFS?

2009-11-25 Thread Edward J. Yoon
Just FYI, Hadoop and M/R is a distributed computing system. So, there is a problem of locality and location of sub-matrix blocks. Moreover, M/R iteration method is really slow. To perform the matrix multiplication (and also graph algorithm) on Hadoop, Apache Hama team is considering a BSP (bulk sy

Re: Using hadoop for Matrix Multiplication in NFS?

2009-11-25 Thread Tsz Wo (Nicholas), Sze
Hi Gimich, Could you describe your matrix multiplication problem in more details? Are the matrices sparse or dense? How big is the on-disk-size of a matrix? Thanks. Nicholas Sze - Original Message > From: Edward J. Yoon > To: common-user@hadoop.apache.org > Sent: Tue, November 24

time outs when accessing port 50010

2009-11-25 Thread David J. O'Dell
I have 2 clusters: 30 nodes running 0.18.3 and 36 nodes running 0.20.1 I've intermittently seen the following errors on both of my clusters, it happens when writing files. I was hoping this would go away with the new version but I see the same behavior on both versions. The namenode logs don't

EC2 Hadoop 0.19 image is only 3 Gigs harddrive - too small

2009-11-25 Thread Mark Kerzner
Hi, I have started the Apache distribution of hadoop-0.19, and I noticed that this Fedora image has only 3 Gig hard drive. Why is it so small? It was 50% on startup, and with me adding a few things it is now at 80%. Anything I can do to increase this? I would not have much room to work on. Thank

Re: time outs when accessing port 50010

2009-11-25 Thread Raghu Angadi
You could check whats going on on second datanode (10.1.75.104). The connect to it from the first datanode took longer than 2 min. Btw, it looks to be a bug that there is a mismatch in timeout used by datanode and the client. Ideally client's timeout should be larger than the timeout used by the d

Re: EC2 Hadoop 0.19 image is only 3 Gigs harddrive - too small

2009-11-25 Thread Tom White
Hi Mark, The root partition is small, but there is plenty of storage on the /mnt partition. See http://aws.amazon.com/ec2/instance-types/. Cheers, Tom On Wed, Nov 25, 2009 at 12:30 PM, Mark Kerzner wrote: > Hi, > > I have started the Apache distribution of hadoop-0.19, and I noticed that > this

Re: EC2 Hadoop 0.19 image is only 3 Gigs harddrive - too small

2009-11-25 Thread Mark Kerzner
Tom, Can I install applications on /mnt? Are they saved on reboot and on image bundling? If not, I will have to be extra tight about this, but use /mnt as my working area. That is still something, so thank you. Mark On Wed, Nov 25, 2009 at 2:48 PM, Tom White wrote: > Hi Mark, > > The root part

Re: EC2 Hadoop 0.19 image is only 3 Gigs harddrive - too small

2009-11-25 Thread Mark Kerzner
Oh, and all instances http://aws.amazon.com/ec2/instance-types/ have 10 Gig root partition, so where did this go? I guess somebody wanted extra small root. On Wed, Nov 25, 2009 at 2:48 PM, Tom White wrote: > Hi Mark, > > The root partition is small, but there is plenty of storage on the > /mnt

part-00000.deflate as output

2009-11-25 Thread Mark Kerzner
Hi, I get this part-0.deflate instead of part-0. How do I get rid of the deflate option? Thank you, Mark

The name of the current input file during a map

2009-11-25 Thread Saptarshi Guha
Hello, I have a set of input files part-r-* which I will pass through another map(no reduce). the part-r-* files consist of key, values, keys being small, values fairly large(MB's) I would like to index these, i.e run a map, whose output is key and /filename/ i.e to which part-r-* file the partic

Re: part-00000.deflate as output

2009-11-25 Thread Amogh Vasekar
Hi, ".deflate" is the default compression codec used when parameter to generate compressed output is true ( mapred.output.compress ). You may set the codec to be used via mapred.output.compression.codec, some commonly used are available in hadoop.io.compress package... Amogh On 11/26/09 11:03

Re: part-00000.deflate as output

2009-11-25 Thread Tim Kiefer
For testing purposes you can also try to disable the compression: conf.setBoolean("mapred.output.compress", false); Then you can look at the output. - tim Amogh Vasekar wrote: Hi, ".deflate" is the default compression codec used when parameter to generate compressed output is true ( mapred.o

Re: The name of the current input file during a map

2009-11-25 Thread Amogh Vasekar
Conf.get(map.input.file) is what you need. Amogh On 11/26/09 12:35 PM, "Saptarshi Guha" wrote: Hello, I have a set of input files part-r-* which I will pass through another map(no reduce). the part-r-* files consist of key, values, keys being small, values fairly large(MB's) I would like to

Re: The name of the current input file during a map

2009-11-25 Thread Saptarshi Guha
Thank you. Regards Saptarshi On Thu, Nov 26, 2009 at 2:10 AM, Amogh Vasekar wrote: > Conf.get(map.input.file) is what you need. > > Amogh > > > On 11/26/09 12:35 PM, "Saptarshi Guha" wrote: > > Hello, > I have a set of input files part-r-* which I will pass through another > map(no reduce).  the

Re: The name of the current input file during a map

2009-11-25 Thread Saptarshi Guha
Hello again, I'm using Hadoop 0.21 and its context object e.g public void setup(Context context) { Configuration cfg = context.getConfiguration(); System.out.println("mapred.input.file="+cfg.get("mapred.input.file")); displays null, so maybe this fell out by mistake in the api change? R