Re: wordcount getting slower with more mappers and reducers?

2009-03-05 Thread Amareshwari Sriramadasu
Are you hitting HADOOP-2771? -Amareshwari Sandy wrote: Hello all, For the sake of benchmarking, I ran the standard hadoop wordcount example on an input file using 2, 4, and 8 mappers and reducers for my job. In other words, I do: time -p bin/hadoop jar hadoop-0.18.3-examples.jar wordcount -m

Re: contrib EC2 with hadoop 0.17

2009-03-05 Thread Tom White
I haven't used Eucalyptus, but you could start by trying out the Hadoop EC2 scripts (http://wiki.apache.org/hadoop/AmazonEC2) with your Eucalyptus installation. Cheers, Tom On Tue, Mar 3, 2009 at 2:51 PM, falcon164 mujahid...@gmail.com wrote: I am new to hadoop. I want to run hadoop on

Re: Hadoop AMI for EC2

2009-03-05 Thread Tom White
Hi Richa, Yes there is. Please see http://wiki.apache.org/hadoop/AmazonEC2. Tom On Thu, Mar 5, 2009 at 4:13 PM, Richa Khandelwal richa...@gmail.com wrote: Hi All, Is there an existing Hadoop AMI for EC2 which had Hadaoop setup on it? Thanks, Richa Khandelwal University Of California,

Re: Hadoop AMI for EC2

2009-03-05 Thread tim robertson
Yeps, A good starting read: http://wiki.apache.org/hadoop/AmazonEC2 These are the AMIs: $ ec2-describe-images -a | grep hadoop IMAGE ami-245db94dcloudbase-1.1-hadoop-fc64/image.manifest.xml 247610401714available public x86_64 machine IMAGE ami-791ffb10

Re: Hadoop AMI for EC2

2009-03-05 Thread Richa Khandelwal
Hi All, I am trying trying to log map reduce jobs in HADOOP_LOG_DIR by setting its value in hadoop-env.sh. But the directory has no log records when the job finishes running. I am adding JobConf.setProfileEnabled(true) in my job. Can anyone point out how to log in hadoop? Thanks, Richa On Thu,

Re: wordcount getting slower with more mappers and reducers?

2009-03-05 Thread Sandy
I used three different sample.txt files, and was able to replicate the error. The first was 1.5MB, the second 66MB, and the last 428MB. I get the same problem despite what size of input file I use: the running time of wordcount increases with the number of mappers and reducers specified. If it is

Re: System Layout Best Practices

2009-03-05 Thread Sandy
Hi David, I don't know if you've seen this already, but this might be of some help: http://hadoop.apache.org/core/docs/r0.18.3/cluster_setup.html Near the bottom, there is a section called Real-World Cluster Configurations with some sample configuration parameters that were used to run a very

Re: wordcount getting slower with more mappers and reducers?

2009-03-05 Thread Matt Ingenthron
Sandy wrote: I used three different sample.txt files, and was able to replicate the error. The first was 1.5MB, the second 66MB, and the last 428MB. I get the same problem despite what size of input file I use: the running time of wordcount increases with the number of mappers and reducers

Re: wordcount getting slower with more mappers and reducers?

2009-03-05 Thread Sandy
I specified a directory containing my 428MB file split into 8 files. Same results. I should summarize my hadoop-site.xml file: mapred.tasktracker.tasks.maximum = 4 mapred.line.input.format.linespermap = 1 mapred.task.timeout = 0 mapred.min.split.size = 1 mapred.child.java.opts = -Xmx2M

Recommend JSON Library? net.sf.json has memory leak

2009-03-05 Thread Kevin Peterson
We're using JSON serialization for all our data, but we can't seem to find a good library. We just discovered that the root cause of out of memory errors is a leak in the net.sf.json library. Can anyone out there recommend a java json library that they have actually used successfully within

Re: Recommend JSON Library? net.sf.json has memory leak

2009-03-05 Thread Ian Swett
We've used Jackson(http://jackson.codehaus.org/), which we've found to be easy to use and faster than any other option. We've also had problems with net.sf in terms of memory and performance. You can see a performance comparison here:

Re: Recommend JSON Library? net.sf.json has memory leak

2009-03-05 Thread Ken Weiner
I had discovered a memory leak in net.sf.json as well. I filed an issue and it got fixed in the latest release: http://sourceforge.net/tracker/?func=detailatid=857928aid=2063201group_id=171425 Have you tried the latest version 2.2.3? On Thu, Mar 5, 2009 at 9:48 AM, Kevin Peterson

DataNode stops cleaning disk?

2009-03-05 Thread Igor Bolotin
Normally I dislike writing about problems without being able to provide some more information, but unfortunately in this case I just can't find anything. Here is the situation - DFS cluster running Hadoop version 0.19.0. The cluster is running on multiple servers with practically identical

Re: wordcount getting slower with more mappers and reducers?

2009-03-05 Thread Arun C Murthy
I assume you have only 2 map and 2 reduce slots per tasktracker - which totals to 2 maps/reduces for you cluster. This means with more maps/reduces they are serialized to 2 at a time. Also, the -m is only a hint to the JobTracker, you might see less/more than the number of maps you have

Re: Recommend JSON Library? net.sf.json has memory leak

2009-03-05 Thread Doug Cutting
Ian Swett wrote: We've used Jackson(http://jackson.codehaus.org/), which we've found to be easy to use and faster than any other option. I also use Jackson and recommend it. Doug

Re: DataNode stops cleaning disk?

2009-03-05 Thread Raghu Angadi
This is unexpected unless some other process is eating up space. Couple of things to collect next time (along with log): - All the contents under datanode-directory/ (especially including 'tmp' and 'current') - Does 'du' of this directory match with what is reported to NameNode (shown on

RE: DataNode stops cleaning disk?

2009-03-05 Thread Igor Bolotin
That's what I saw just yesterday on one of the data nodes with this situation (will confirm also next time it happens): - Tmp and current were either empty or almost empty last time I checked. - du on the entire data directory matched exactly with reported used space in NameNode web UI and it did

Batch processing map reduce jobs

2009-03-05 Thread Richa Khandelwal
Hi All, Does anyone know how to run map reduce jobs using pipes or batch process map reduce jobs? Thanks, Richa Khandelwal University Of California, Santa Cruz. Ph:425-241-7763

Re: wordcount getting slower with more mappers and reducers?

2009-03-05 Thread Sandy
I was trying to control the maximum number of tasks per tasktracker by using the mapred.tasktracker.tasks.maximum parameter I am interpreting your comment to mean that maybe this parameter is malformed and should read: mapred.tasktracker.map.tasks.maximum = 8 mapred.tasktracker.map.tasks.maximum

Re: wordcount getting slower with more mappers and reducers?

2009-03-05 Thread haizhou zhao
As I metioned above, you should at least try like this: map2 reduce1 map4 reduce1 map8 reduce1 map4 reduce1 map4 reduce2 map4 reduce4 instead of : map2 reduce2 map4 reduce4 map8 reduce8 2009/3/6 Sandy snickerdoodl...@gmail.com I was trying to control the maximum number of tasks per

Re: Throwing an IOException in Map, yet task does not fail

2009-03-05 Thread Jothi Padmanabhan
Just trying to understand this better, are you observing that the task, which failed with the IOException, not getting marked as killed? If yes, that does not look right... Jothi On 3/6/09 8:12 AM, Saptarshi Guha saptarshi.g...@gmail.com wrote: Hello, I have given a case where my mapper

Re: Throwing an IOException in Map, yet task does not fail

2009-03-05 Thread Amareshwari Sriramadasu
Is your job a streaming job? If so, Which version of hadoop are you using? what is the configured value for stream.non.zero.exit.is.failure? Can you see stream.non.zero.exit.is.failure to true and try again? Thanks Amareshwari Saptarshi Guha wrote: Hello, I have given a case where my mapper

Re: Running 0.19.2 branch in production before release

2009-03-05 Thread Aaron Kimball
Right, there's no sense in freezing your Hadoop version forever :) But if you're an ops team tasked with keeping a production cluster running 24/7, running on 0.19 (or even more daringly, TRUNK) is not something that I would consider a Best Practice. Ideally you'll be able to carve out some spare

Re: Repartitioned Joins

2009-03-05 Thread Aaron Kimball
Richa, Since the mappers run independently, you'd have a hard time determining whether a record in mapper A would be joined by a record in mapper B. The solution, as it were, would be to do this in two separate MapReduce passes: * Take an educated guess at which table is the smaller data set. *

Re: Throw an exception if the configure method fails

2009-03-05 Thread Aaron Kimball
Try throwing RuntimeException, or any other unchecked exception (e.g., any descendant classes of RuntimeException) - Aaron On Thu, Mar 5, 2009 at 4:24 PM, Saptarshi Guha saptarshi.g...@gmail.comwrote: hello, I'm not that comfortable with java, so here is my question. In the MapReduceBase

Re: The cpu preemption between MPI and Hadoop programs on Same Cluster

2009-03-05 Thread Aaron Kimball
Song, you should be able to use 'nice' to reprioritize the MPI task below that of your Hadoop jobs. - Aaron On Thu, Mar 5, 2009 at 8:26 PM, 柳松 lamfeel...@126.com wrote: Dear all: I run my hadoop program with another MPI program on the same cluster. here is the result of top. PID USER

Fetch errors. 2 node cluster.

2009-03-05 Thread pavelkolodin
Hello to all. I have 2 nodes in cluster - master + slave. names master1 and slave1 stored in /etc/hosts on both hosts and they are 100% correct. conf/masters: master1 conf/slaves: master1 slave1 conf/slaves + conf/masters are empty on slave1 node. I tried to fill them in many ways - it

Re: Reduce doesn't start until map finishes

2009-03-05 Thread Rasit OZDAS
So, is there currently no solution to my problem? Should I live with it? Or do we have to have a JIRA for this? What do you think? 2009/3/4 Nick Cen cenyo...@gmail.com Thanks, about the Secondary Sort, can you provide some example. What does the intermediate keys stands for? Assume I have