how to optimize mapreduce procedure??

2009-03-11 Thread ZhiHong Fu
Hello, I'm writing a program which will finish lucene searching in about 12 index directorys, all of them are stored in HDFS. It is done like this: 1. We get about 12 index Directorys through lucene index functionality, each of which about 100M size, 2. We store these 12 index directory

RE: using virtual slave machines

2009-03-11 Thread Karthikeyan V
There is no specific procedure for configuring virtual machine slaves. make sure the following thing are done. 1.All machine's(both VM's and physical machines) public key are distributed to all "~/.ssh/authorized_keys" file. 2.conf/hadoop-site.xml file is similar for all the machi

Re: What happens when you do a ctrl-c on a big dfs -rmr

2009-03-11 Thread 何 永强
All deleted or not a file deleted at all, that depending on how fast you press the ctrl-c. The delete command is not executed in your terminal, instead the rmr command is sent to the hadoop namenode and is executed there. On 09-3-12 上午10:48, "bzheng" wrote: > > I did a ctrl-c immediately aft

Re: How to read output files over HDFS

2009-03-11 Thread lohit
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample Loht - Original Message From: Amandeep Khurana To: core-user@hadoop.apache.org Sent: Wednesday, March 11, 2009 9:46:09 PM Subject: Re: How to read output files over HDFS 2 ways that I can think of: 1. Write another MR job witho

about block size

2009-03-11 Thread ChihChun Chu
Hi, I have a question about how to decide the block size. As I understanding, the block size is related to namenode's heap size(how many blocks can be handled), total storage capacity of clusters, the files size (depends on applications, e.g. 1T log file), #of replicas, and the performance of mapr

Re: How to read output files over HDFS

2009-03-11 Thread Amandeep Khurana
2 ways that I can think of: 1. Write another MR job without a reducer. The mapper can be made to do whatever logic you want to do. OR 2. Take an instance of DistributedFileSystem class in your java code and use it to read the file from HDFS. Amandeep Khurana Computer Science Graduate Student Un

How to read output files over HDFS

2009-03-11 Thread Muhammad Arshad
Hi, I am running multiple MapReduce jobs which generate their output in directories named output0, output1, output2, ...etc. Once these jobs complete i want to read the output stored in these files(line by line) using a Java code automatically. Kindly tell me how i can do this. I do not want

Re: What happens when you do a ctrl-c on a big dfs -rmr

2009-03-11 Thread lohit
When you issue -rmr with directory, namenode get a directory name and starts deleting files recursively. It adds the blocks belonging to files to invalidate list. NameNode then deletes those blocks lazily. So, yes it will issue command to datanodes to delete those blocks, just give it some time

What happens when you do a ctrl-c on a big dfs -rmr

2009-03-11 Thread bzheng
I did a ctrl-c immediately after issuing a hadoop dfs -rmr command. The rmr target is no longer visible from the dfs -ls command. The number of files deleted is huge and I don't think it can possibly delete them all between the time the command is issued and ctrl-c. Does this mean it leaves beh

RE: Data loss in Hadoop 0.19.1

2009-03-11 Thread Koji Noguchi
FYI: We temporarily lost couple of blocks in 0.18.3 due to https://issues.apache.org/jira/browse/HADOOP-5465 Fix should be coming soon (to 0.19 as well). Koji -Original Message- From: Nathan Marz [mailto:nat...@rapleaf.com] Sent: Wednesday, March 11, 2009 3:37 PM To: core-user@hadoo

Data loss in Hadoop 0.19.1

2009-03-11 Thread Nathan Marz
Are there any known data loss problems remaining in Hadoop 0.19.1? Thanks, Nathan Marz

Warning when using 2.6.27 (Was: Datanode goes missing, results in Too many open files in DFSClient)

2009-03-11 Thread Jean-Daniel Cryans
I found the solution here : http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/ J-D On Fri, Mar 6, 2009 at 6:08 PM, Jean-Daniel Cryans wrote: > I know this one may be weird, but I'll give it a try. Thanks to anyone > reading this through. > > Setup : hadoop-0

where should Cloudera host their next public training session?

2009-03-11 Thread Christophe Bisciglia
Hey there, we're trying to decide where to host our next public training session, so I'd like to simply ask - where is it needed? Use this form or just drop me a note: http://spreadsheets.google.com/viewform?formkey=cHZfNzNoLUlkU0dJY0VhUVUwVlpnUUE6MA We'll do this over two days, with one day being

RE: streaming error when submit the job:Cannot run program "chmod": java.io.IOException: error=12, Cannot allocate memory

2009-03-11 Thread Koji Noguchi
Shixing, Discussion on https://issues.apache.org/jira/browse/HADOOP-5059 may be related. Koji -Original Message- From: shixing [mailto:paradise...@gmail.com] Sent: Wednesday, March 11, 2009 1:31 AM To: core-user@hadoop.apache.org Subject: streaming error when submit the job:Cannot r

Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

2009-03-11 Thread Gyanit
Here are exact numbers: # of (k,v) pairs = 1.2 Mil this is same. # of unique k = 1000, k is integer. # of unique v = 1Mil, v is a big big string. For a given k, cumulative size of all v associated to it is about 30 Mb. (That is each v is about 25~30Kb) # of Mappers = 30 # of Reducers = 10 (v,k)

Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

2009-03-11 Thread Tim Wintle
On Tue, 2009-03-10 at 19:44 -0700, Gyanit wrote: > I have large number of key,value pairs. I don't actually care if data goes in > value or key. Let me be more exact. > (k,v) pair after combiner is about 1 mil. I have approx 1kb data for each > pair. I can put it in keys or values. > I have experi

Re: Error while putting data onto hdfs

2009-03-11 Thread Raghu Angadi
Raghu Angadi wrote: Amandeep Khurana wrote: My dfs.datanode.socket.write.timeout is set to 0. This had to be done to get Hbase to work. ah.. I see, we should fix that. Not sure how others haven't seen it till now. Affects only those with write.timeout set to 0 on the clients. filed : https

RE: Persistent HDFS On EC2

2009-03-11 Thread Malcolm Matalka
Haha, good to know I might be a guinea pig! -Original Message- From: Kris Jirapinyo [mailto:kris.jirapi...@biz360.com] Sent: Wednesday, March 11, 2009 15:59 To: core-user@hadoop.apache.org Subject: Re: Persistent HDFS On EC2 That was also the starting point for my experiment (Tom White's

Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

2009-03-11 Thread Scott Carey
Well if the smaller keys are producing fewer unique values, there should be some more significant differences. I had assumed that your test produced the same number of unique values. I'm still not sure why there would be that significant of a difference as long as the total number of unique val

Re: Persistent HDFS On EC2

2009-03-11 Thread Kris Jirapinyo
That was also the starting point for my experiment (Tom White's article). Note that the most painful part about this setup is probably writing and testing the scripts that will enable this to happen (and also customizing your EC2 images). It would be interesting to see someone else try it. On We

Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

2009-03-11 Thread Gyanit
I notices one more thing. Lighter keys tend to make smaller number of unique keys. For example (key,value) pairs may be 10Mil, but if key is lighter unique keys might be just 1000. In other case if keys are heavier unique keys might be 5 mil. I think this might have something to do with it. Botto

Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

2009-03-11 Thread Gyanit
I notices one more thing. Lighter keys tend to make smaller number of unique keys. For example (key,value) pairs may be 10Mil, but if key is lighter unique keys might be just 1000. In other case if keys are heavier unique keys might be 5 mil. I think this might have something to do with it. Botto

Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

2009-03-11 Thread Scott Carey
That is a fascinating question. I would also love to know the reason behind this. If I were to guess I would have thought that smaller keys and heavier values would slightly outperform, rather than significantly underperform. (assuming total pair count at each phase is the same). Perhaps th

Re: Error while putting data onto hdfs

2009-03-11 Thread Raghu Angadi
Amandeep Khurana wrote: What happens if you set it to 0? How is it a workaround? HBase needs it in pre-19.0 (related story : http://www.nabble.com/Datanode-Xceivers-td21372227.html). It should not matter if you move to 0.19.0 or newer. And how would it matter if I change is to a large valu

Re: Not a host:port pair when running balancer

2009-03-11 Thread Doug Cutting
Konstantin Shvachko wrote: The port was not specified at all in the original configuration. Since 0.18, the port is optional. If no port is specified, then 8020 is used. 8020 is the default port for namenodes. https://issues.apache.org/jira/browse/HADOOP-3317 Doug

Re: HDFS is corrupt, need to salvage the data.

2009-03-11 Thread Raghu Angadi
Mayuran, It takes very long for a lot of iterations if we have to go through each debugging step, one at a time. May be a jira is a good place. - Run fsck with blocks option. - Check if those ids match with ids in file names found by 'find'. - Check which directory are these files in.. and v

Re: Persistent HDFS On EC2

2009-03-11 Thread Adam Rose
Tom White wrote a great blog post about some options here: http://www.lexemetech.com/2008/08/elastic-hadoop-clusters-with-amazons.html plus an Amazon article: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873&categoryID=112 Regards, - Adam Kris Jirapinyo wrote: Why w

Re: Error while putting data onto hdfs

2009-03-11 Thread Amandeep Khurana
What happens if you set it to 0? How is it a workaround? And how would it matter if I change is to a large value? Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Mar 11, 2009 at 12:00 PM, Raghu Angadi wrote: > Amandeep Khurana wrote: > >> My dfs.

Re: Error while putting data onto hdfs

2009-03-11 Thread Raghu Angadi
Amandeep Khurana wrote: My dfs.datanode.socket.write.timeout is set to 0. This had to be done to get Hbase to work. ah.. I see, we should fix that. Not sure how others haven't seen it till now. Affects only those with write.timeout set to 0 on the clients. Since setting it to 0 itself is a w

Re: Not a host:port pair when running balancer

2009-03-11 Thread Konstantin Shvachko
This is not about the default port. The port was not specified at all in the original configuration. --Konstantin Doug Cutting wrote: Konstantin Shvachko wrote: Clarifying: port # is missing in your configuration, should be fs.default.name hdfs://hvcwydev0601:8020 where 8020 is your por

Re: HDFS is corrupt, need to salvage the data.

2009-03-11 Thread Mayuran Yogarajah
Mayuran Yogarajah wrote: Raghu Angadi wrote: The block files usually don't disappear easily. Check on the datanode if you find any files starting with "blk". Also check datanode log to see what happened there... may be use started on a different directory or something like that. Raghu.

Can we find a task-attemp status internally while processing the job?

2009-03-11 Thread Yair Even-Zohar
I have the following class definition: public class Ase2DbMapRed extends MapReduceBase implements TableMap, Tool { I am also implementing the close() method extended from MapReduceBase Is it possible to know (and how?) within the "public class close()..." method whether this parti

Re: Error while putting data onto hdfs

2009-03-11 Thread Amandeep Khurana
My dfs.datanode.socket.write.timeout is set to 0. This had to be done to get Hbase to work. Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Mar 11, 2009 at 10:23 AM, Raghu Angadi wrote: > > Did you change dfs.datanode.socket.write.timeout to 5 se

Re: Not a host:port pair when running balancer

2009-03-11 Thread Raghu Angadi
Doug Cutting wrote: Konstantin Shvachko wrote: Clarifying: port # is missing in your configuration, should be fs.default.name hdfs://hvcwydev0601:8020 where 8020 is your port number. That's the work-around, but it's a bug. One should not need to specify the default port number (8020).

Re: Not a host:port pair when running balancer

2009-03-11 Thread Doug Cutting
Konstantin Shvachko wrote: Clarifying: port # is missing in your configuration, should be fs.default.name hdfs://hvcwydev0601:8020 where 8020 is your port number. That's the work-around, but it's a bug. One should not need to specify the default port number (8020). Please file an issu

Re: Not a host:port pair when running balancer

2009-03-11 Thread Konstantin Shvachko
Clarifying: port # is missing in your configuration, should be fs.default.name hdfs://hvcwydev0601:8020 where 8020 is your port number. --Konstantin Hairong Kuang wrote: Please try using the port number 8020. Hairong On 3/11/09 9:42 AM, "Stuart White" wrote: I've been running hado

Re: Error while putting data onto hdfs

2009-03-11 Thread Raghu Angadi
Did you change dfs.datanode.socket.write.timeout to 5 seconds? The exception message says so. It is extremely small. The default is 8 minutes and is intentionally pretty high. Its purpose is mainly to catch extremely unresponsive datanodes and other network issues. Raghu. Amandeep Khurana

tuning performance

2009-03-11 Thread Vadim Zaliva
Hi! I have a question about fine-tunining hadoop performance on 8-core machines. I have 2 machines I am testing. One is 8-core Xeon and another is 8-core Opteron. 16Gb RAM each. They both run mapreduce and dfs nodes. Currently I've set up each of them to run 32 map and 8 reduce tasks. Also, HADOOP

Re: Not a host:port pair when running balancer

2009-03-11 Thread Hairong Kuang
Please try using the port number 8020. Hairong On 3/11/09 9:42 AM, "Stuart White" wrote: > I've been running hadoop-0.19.0 for several weeks successfully. > > Today, for the first time, I tried to run the balancer, and I'm receiving: > > java.lang.RuntimeException: Not a host:port pair: hvcwy

Not a host:port pair when running balancer

2009-03-11 Thread Stuart White
I've been running hadoop-0.19.0 for several weeks successfully. Today, for the first time, I tried to run the balancer, and I'm receiving: java.lang.RuntimeException: Not a host:port pair: hvcwydev0601 In my hadoop-site.xml, I have this: fs.default.name hdfs://hvcwydev0601/ What do I nee

Re: wordcount getting slower with more mappers and reducers?

2009-03-11 Thread Jim Twensky
Sandy, Correct me if I'm wrong but, if you have only two cores and you are running your jobs in pseudo distributed mode, what is the point of having more than 2 mappers/reducers? Any number larger than 2 would make the mapper/reducer threads serialize. That serialization would certainly be an over

Re: Extending ClusterMapReduceTestCase

2009-03-11 Thread jason hadoop
Finally remembered, we had saxon 6.5.5 in the class path, and the jetty error was 09/03/11 08:23:20 WARN xml.XmlParser: EXCEPTION javax.xml.parsers.ParserConfigurationException: AElfred parser is non-validating On Wed, Mar 11, 2009 at 8:01 AM, jason hadoop wrote: > I am having trouble reproducing

Re: Extending ClusterMapReduceTestCase

2009-03-11 Thread jason hadoop
I am having trouble reproducing this one. It happened in a very specific environment that pulled in an alternate sax parser. The bottom line is that jetty expects a parser with particular capabilities and if it doesn't get one, odd things happen. In a day or so I will have hopefully worked out th

Re: Persistent HDFS On EC2

2009-03-11 Thread Kris Jirapinyo
Why would you lose the locality of storage-per-machine if one EBS volume is mounted to each machine instance? When that machine goes down, you can just restart the instance and re-mount the exact same volume. I've tried this idea before successfully on a 10 node cluster on EC2, and didn't see any

RE: Persistent HDFS On EC2

2009-03-11 Thread Malcolm Matalka
I am estimating that all of the data I will need to run the job will be ~2 terabytes. Is that too large a data set to be copying from S3 every startup? -Original Message- From: Steve Loughran [mailto:ste...@apache.org] Sent: Wednesday, March 11, 2009 9:39 To: core-user@hadoop.apache.org

Re: Persistent HDFS On EC2

2009-03-11 Thread Steve Loughran
Malcolm Matalka wrote: If this is not the correct place to ask Hadoop + EC2 questions please let me know. I am trying to get a handle on how to use Hadoop on EC2 before committing any money to it. My question is, how do I maintain a persistent HDFS between restarts of instances. Most of th

Persistent HDFS On EC2

2009-03-11 Thread Malcolm Matalka
If this is not the correct place to ask Hadoop + EC2 questions please let me know. I am trying to get a handle on how to use Hadoop on EC2 before committing any money to it. My question is, how do I maintain a persistent HDFS between restarts of instances. Most of the tutorials I have found i

using virtual slave machines

2009-03-11 Thread arulP
In hadoop cluster management ,im trying to replace the physical machine slaves to virtual machine slaves.whether there is any change in procedure. Also the pseudo distribution was successful in all virtual machines . -- View this message in context: http://www.nabble.com/using-virtual-slave-mac

Re: Extending ClusterMapReduceTestCase

2009-03-11 Thread Steve Loughran
jason hadoop wrote: The other goofy thing is that the xml parser that is commonly first in the class path, validates xml in a way that is opposite to what jetty wants. What does ant -diagnostics say? It will list the XML parser at work This line in the preamble before theClusterMapReduceTes

streaming error when submit the job:Cannot run program "chmod": java.io.IOException: error=12, Cannot allocate memory

2009-03-11 Thread shixing
09/03/11 15:43:55 ERROR streaming.StreamJob: Error Launching job : java.io.IOException: Cannot run program "chmod": java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)

Re: streaming inputformat: class not found

2009-03-11 Thread Amareshwari Sriramadasu
Till 0.18.x, files are not added to client-side classpath. Use 0.19, and run following command to use custom input format bin/hadoop jar contrib/streaming/hadoop-0.19.0-streaming.jar -mapper mapper.pl -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input test.data -output test-output -fi