Re: Hadoop streaming performance problem

2008-03-31 Thread Amareshwari Sriramadasu
LineRecordReader.readLine() is deprecated by HADOOP-2285(http://issues.apache.org/jira/browse/HADOOP-2285) because it was slow. But streaming still uses the method. HADOOP-2826 (http://issues.apache.org/jira/browse/HADOOP-2826) will remove the usage in streaming. This change should improve str

Re: Run job not from namenode

2008-03-31 Thread Amar Kamat
Andrey Pankov wrote: Hi all, Currently I'm able to run map-reduce jobs from box where NameNode and JobTracker are running. But I'd like to run my jobs from separate box, from which I have access to HDFS. I have updated params fs.default.name and mapred.job.tracker in local hadoop dir to poin

Re: Hadoop streaming performance problem

2008-03-31 Thread Andreas Kostyrka
Beg you pardon, Python is a fast language, although simple operations are usually quite more expensive then in lower level languages, but at least when used by somebody who has enough experience, that doesn't matter to much. Actually, in many practical cases, because of project deadlines, C++ (and

Re: Hadoop streaming performance problem

2008-03-31 Thread Theodore Van Rooy
I agree... as I said in one of the earlier emails, I saw a 50% speedup in a perl script which categorizes O(10^9) rows at a time. Also I wrote a very simple python script (something like a 'cat'), and saw similar speedup. These tests were with 1 Gig files. We were testing this here at DoubleClick

Re: Hadoop streaming performance problem

2008-03-31 Thread Ted Dunning
My experiences with Groovy are similar. Noticeable slowdown, but quite bearable (almost always better than 50% of best attainable speed). The highest virtue is that simple programs become simple again. Word count is < 5 lines of code. On 3/31/08 6:10 PM, "Colin Evans" <[EMAIL PROTECTED]> w

Re: reduce task hanging or just slow?

2008-03-31 Thread Colin Freas
I believe that this is exactly what happened. I'm not sure exactly what happened, but the networking stack on the master node was all screwed up somehow. All the machines serve double duty as development boxes, and they're on two different networks. The master node could contact the cluster netw

Re: Hadoop streaming performance problem

2008-03-31 Thread Colin Evans
At Metaweb, we did a lot of comparisons between streaming (using Python) and native Java, and in general streaming performance was not much slower than the native java -- most of the slowdown was from Python being a slow language. The main problems with streaming apps that we found are that th

Re: reduce task hanging or just slow?

2008-03-31 Thread Mafish Liu
Hi: I have met the similar problem with you. Finally, I found that this problem was caused by the hostname resolution because hadoop use hostname to access other nodes. To fix this, try open your jobtracker log file( It often resides in $HADOOP_HOME/logs/hadoop--jobtracker-.log ) t

Re: Hadoop streaming performance problem

2008-03-31 Thread Andreas Kostyrka
Yes. I would have to look up the exact numbers in my Sent folder (I've been reporting all the time to my boss), but it has been a sizeable portion of the runtime (on an EC2+HDFS-on-S3 cluster), especially when one included the time to load the data into HDFS. Ah gone ahead and looked it up:

Re: Hadoop streaming performance problem

2008-03-31 Thread Andreas Kostyrka
Because many many people do one enjoy that verbose language you know. (Just replaced an old 754 lines long task with a ported one that take 89 lines.) So as crazy it might sound to some here, hadoop streaming is the primary interface for probably a sizeable part of the "user population". (Users be

Re: Hadoop streaming performance problem

2008-03-31 Thread Travis Brady
> > A set of reasonable performance tests and results would be very helpful > in helping people decide whether to go with streaming or not. Hopefully > we can get some numbers from this thread and publish them? Anyone else > compared streaming with native java? > I think this is a great idea. I t

Re: reduce task hanging or just slow?

2008-03-31 Thread Nathan Fiedler
I don't have any particular experience with this, but perhaps X-Trace [1] can help. The presentation given at the Hadoop Summit was very impressive, looks like a great debugging tool. There are hooks already in Hadoop, so I think it's just a matter of enabling them, collecting the data, and generat

Re: Hadoop streaming performance problem

2008-03-31 Thread Colin Freas
Really? I would expect the opposite: for compressed files to process slower. You're saying that is not the case, and that compressed files actually increase the speed of jobs? -Colin On Mon, Mar 31, 2008 at 4:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]> wrote: > Well, on our EC2/HDFS-on-S3 clus

Re: Hadoop streaming performance problem

2008-03-31 Thread Parand Darugar
Travis Brady wrote: This brings up two interesting issues: 1. Hadoop streaming is a potentially very powerful tool, especially for those of us who don't work in Java for whatever reason 2. If Hadoop streaming is "at best a jury rigged solution" then that should be made known somewhere on the wik

Re: Hadoop streaming performance problem

2008-03-31 Thread Jeff Hammerbacher
We have 40 or so engineers who only use streaming at Facebook, so no matter how jury-rigged the solution might be, it's been immensely valuable for developer productivity. As we found with Thrift, letting developers write code in their language of choice has many benefits, including development sp

Re: Hadoop streaming performance problem

2008-03-31 Thread Ted Dunning
It is provided because of point (2). But that doesn't necessarily make it a good thing to do. The basic idea has real problems. Hive is likely to resolve many of these issues (when it becomes publicly available) but some are inherent with the basic idea of moving data across language barriers t

Re: Hadoop streaming performance problem

2008-03-31 Thread Travis Brady
This brings up two interesting issues: 1. Hadoop streaming is a potentially very powerful tool, especially for those of us who don't work in Java for whatever reason 2. If Hadoop streaming is "at best a jury rigged solution" then that should be made known somewhere on the wiki. If it's really not

Re: Hadoop streaming performance problem

2008-03-31 Thread Ted Dunning
This seems a bit surprising. In my experience well-written Java is generally just about as fast as C++, especially for I/O bound work. The exceptions are: - java startup is still slow. This shouldn't matter much here because you are using streaming anyway so you have java startup + C startup.

Re: Hadoop streaming performance problem

2008-03-31 Thread lin
Well, we would like to use hadoop streaming because our current system is in C++ and it is easier to migrate to hadoop streaming. Also we have very strict performance requirements. Java seems to be too slow. I rewrote the first program in Java and it runs 4 to 5 times slower than the C++ one. On M

Re: Hadoop streaming performance problem

2008-03-31 Thread Ted Dunning
Hadoop can't split a gzipped file so you will only get as many maps as you have files. Why the obsession with hadoop streaming? It is at best a jury rigged solution. On 3/31/08 3:12 PM, "lin" <[EMAIL PROTECTED]> wrote: > Does Hadoop automatically decompress the gzipped file? I only have a si

Re: Hadoop streaming performance problem

2008-03-31 Thread lin
My previous java.opts was actually "--server -Xmx512m". I increased the heap size to 1024M and the running time was about the same. The resident sizes of the java processes seem to be no greater than 150M. On Mon, Mar 31, 2008 at 1:56 PM, Theodore Van Rooy <[EMAIL PROTECTED]> wrote: > try extendi

Re: Hadoop streaming performance problem

2008-03-31 Thread lin
Does Hadoop automatically decompress the gzipped file? I only have a single input file. Does it have to be splitted and then gzipped? I gzipped the input file and Hadoop only created one map task. Still java is using more than 90% CPU. On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <[EMAIL PRO

Hadoop Port Configuration

2008-03-31 Thread Natarajan, Senthil
Hi, I am using default settings from hadoop-default.xml and hadoop-site.xml And I just changed this port number mapred.task.tracker.report.address I created the firewall rule to allow port range 5:50100 between the slaves and master. But reduce on the slaves using some other ports

Re: Hadoop streaming performance problem

2008-03-31 Thread Theodore Van Rooy
try extending the java heap size as well.. I'd be interested to see what kind of an effect that has on time (if any). On Mon, Mar 31, 2008 at 2:30 PM, lin <[EMAIL PROTECTED]> wrote: > I'm running custom map programs written in C++. What the programs do is > very > simple. For example, in progra

Re: Hadoop streaming performance problem

2008-03-31 Thread Andreas Kostyrka
Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to provide the input files gzipped. Not great difference (e.g. 50% slower when not gzipped, plus it took more than twice as long to upload the data to HDFS-on-S3 in the first place), but still probably relevant. Andreas Am Montag, den

Re: Hadoop streaming performance problem

2008-03-31 Thread lin
I'm running custom map programs written in C++. What the programs do is very simple. For example, in program 2, for each input lineID node1 node2 ... nodeN the program outputs node1 ID node2 ID ... nodeN ID Each node has 4GB to 8GB of memory. The java memory

Hadoop input path - can it have subdirectories

2008-03-31 Thread Tarandeep Singh
Hi, Can I pass a directory having subdirectories ( which further have subdirectories) to hadoop as input path ? I tried it, but I got error :( -Taran

Re: Hadoop streaming performance problem

2008-03-31 Thread Theodore Van Rooy
are you running a custom map script or a standard linux command like WC? If custom, what does your script do? How much ram do you have? what are you Java memory settings? I used the following setup 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 task max. I got the follo

Hadoop streaming performance problem

2008-03-31 Thread lin
Hi, I am looking into using Hadoop streaming to parallelize some simple programs. So far the performance has been pretty disappointing. The cluster contains 5 nodes. Each node has two CPU cores. The task capacity of each node is 2. The Hadoop version is 0.15. Program 1 runs for 3.5 minutes on th

reduce task hanging or just slow?

2008-03-31 Thread Colin Freas
I've set up a job to run on my small 4 (sometimes 5) node cluster on dual processor server boxes with 2-8GB of memory. My job processes 24 100-300MB files that are a days worth of logs, total data is about 6GB. I've modified the word count example to do what I need, and it works fine on small tes

Re: DFS get blocked when writing a file.

2008-03-31 Thread Raghu Angadi
Iván, Whether this was expected or an error depends on what happened on the client. This could happen and would not be a bug if client was killed for some other reason for e.g. But if client is also similarly surprised then its a different case. You could grep for this block in NameNode log

Re: S3 Support in 16.1

2008-03-31 Thread Jake Thompson
Sweet, did not see that in Jira/did not relate the errors. Now time to wait until 0.16.2 On Mon, Mar 31, 2008 at 7:33 AM, Tom White <[EMAIL PROTECTED]> wrote: > Hi Jake, > > Yes, this is a known issue that is fixed in 0.16.2 - see > https://issues.apache.org/jira/browse/HADOOP-3027. > > Tom > >

Run job not from namenode

2008-03-31 Thread Andrey Pankov
Hi all, Currently I'm able to run map-reduce jobs from box where NameNode and JobTracker are running. But I'd like to run my jobs from separate box, from which I have access to HDFS. I have updated params fs.default.name and mapred.job.tracker in local hadoop dir to point to the clusters mas

Re: S3 Support in 16.1

2008-03-31 Thread Tom White
Hi Jake, Yes, this is a known issue that is fixed in 0.16.2 - see https://issues.apache.org/jira/browse/HADOOP-3027. Tom On 31/03/2008, Jake Thompson <[EMAIL PROTECTED]> wrote: > So I am new to hadoop, but everything is working well so far. > Except. > I can use S3 fs in 15.3 without a pro

Re: DFS get blocked when writing a file.

2008-03-31 Thread Iván de Prado
Thanks, I have tried with the trunk version and now the exception "Trying to change block file offset of block blk_... to ... but actual size of file is ..." has disappeared and the jobs don't seems to get blocked. But I have another "Broken Pipe" and "EOF" exceptions in the dfs logs. They seems

Re: ipc.Client localhost problem

2008-03-31 Thread Peeyush Bishnoi
Hi , You have to set the "garl-intel2" as a namenode parameter for fs.default.name into hadoop-site.xml Just follow this URL : http://hadoop.apache.org/core/docs/current/quickstart.html for installing the hadoop and configuring the hadoop parameters . --- Peeyush On Mon, 2008-03-31 at 14

Re: ipc.Client localhost problem

2008-03-31 Thread Raghavendra K
Hi Peeyush, Thanks a lot for your reply. Can you tell me how to go about adding garl-intel2 to the hadoop-default.xml ? Is there any document available for doing that? Thanking you. On Mon, Mar 31, 2008 at 2:26 PM, Peeyush Bishnoi <[EMAIL PROTECTED]> wrote: > Hello Raghvendra , > > It is working

Re: ipc.Client localhost problem

2008-03-31 Thread Peeyush Bishnoi
Hello Raghvendra , It is working with default because "default" parameter used the configured file system from hadoop-site/hadoop-default.xml . As your localhost and garl-intel2 is not configured with hadoop-site/hadoop-default.xml , it is throwing the error. So with hdfsConnect you have to use H

ipc.Client localhost problem

2008-03-31 Thread Raghavendra K
Hi, When I connect using the code hdfsFS fs = hdfsConnect("default", 0); The program works successfully and am able to save the data also. But when I connect using the code hdfsFS fs = hdfsConnect("localhost", 0); (or) hdfsFS fs = hdfsConnect("garl-intel2", 9000); I get the following error. 08/03/