LineRecordReader.readLine() is deprecated by
HADOOP-2285(http://issues.apache.org/jira/browse/HADOOP-2285) because it was
slow.
But streaming still uses the method. HADOOP-2826
(http://issues.apache.org/jira/browse/HADOOP-2826) will remove the usage in
streaming.
This change should improve str
Andrey Pankov wrote:
Hi all,
Currently I'm able to run map-reduce jobs from box where NameNode and
JobTracker are running. But I'd like to run my jobs from separate box,
from which I have access to HDFS. I have updated params
fs.default.name and mapred.job.tracker in local hadoop dir to poin
Beg you pardon, Python is a fast language, although simple operations
are usually quite more expensive then in lower level languages, but at
least when used by somebody who has enough experience, that doesn't
matter to much. Actually, in many practical cases, because of project
deadlines, C++ (and
I agree... as I said in one of the earlier emails, I saw a 50% speedup in a
perl script which categorizes O(10^9) rows at a time. Also I wrote a very
simple python script (something like a 'cat'), and saw similar speedup.
These tests were with 1 Gig files.
We were testing this here at DoubleClick
My experiences with Groovy are similar. Noticeable slowdown, but quite
bearable (almost always better than 50% of best attainable speed).
The highest virtue is that simple programs become simple again. Word count
is < 5 lines of code.
On 3/31/08 6:10 PM, "Colin Evans" <[EMAIL PROTECTED]> w
I believe that this is exactly what happened.
I'm not sure exactly what happened, but the networking stack on the master
node was all screwed up somehow. All the machines serve double duty as
development boxes, and they're on two different networks. The master node
could contact the cluster netw
At Metaweb, we did a lot of comparisons between streaming (using Python)
and native Java, and in general streaming performance was not much
slower than the native java -- most of the slowdown was from Python
being a slow language.
The main problems with streaming apps that we found are that th
Hi:
I have met the similar problem with you. Finally, I found that this
problem was caused by the hostname resolution because hadoop use hostname to
access other nodes.
To fix this, try open your jobtracker log file( It often resides in
$HADOOP_HOME/logs/hadoop--jobtracker-.log ) t
Yes. I would have to look up the exact numbers in my Sent folder (I've
been reporting all the time to my boss), but it has been a sizeable
portion of the runtime (on an EC2+HDFS-on-S3 cluster), especially when
one included the time to load the data into HDFS.
Ah gone ahead and looked it up:
Because many many people do one enjoy that verbose language you know.
(Just replaced an old 754 lines long task with a ported one that take 89
lines.)
So as crazy it might sound to some here, hadoop streaming is the primary
interface for probably a sizeable part of the "user population". (Users
be
>
> A set of reasonable performance tests and results would be very helpful
> in helping people decide whether to go with streaming or not. Hopefully
> we can get some numbers from this thread and publish them? Anyone else
> compared streaming with native java?
>
I think this is a great idea. I t
I don't have any particular experience with this, but perhaps X-Trace
[1] can help. The presentation given at the Hadoop Summit was very
impressive, looks like a great debugging tool. There are hooks already
in Hadoop, so I think it's just a matter of enabling them, collecting
the data, and generat
Really?
I would expect the opposite: for compressed files to process slower.
You're saying that is not the case, and that compressed files actually
increase the speed of jobs?
-Colin
On Mon, Mar 31, 2008 at 4:51 PM, Andreas Kostyrka <[EMAIL PROTECTED]>
wrote:
> Well, on our EC2/HDFS-on-S3 clus
Travis Brady wrote:
This brings up two interesting issues:
1. Hadoop streaming is a potentially very powerful tool, especially for
those of us who don't work in Java for whatever reason
2. If Hadoop streaming is "at best a jury rigged solution" then that should
be made known somewhere on the wik
We have 40 or so engineers who only use streaming at Facebook, so no
matter how jury-rigged the solution might be, it's been immensely
valuable for developer productivity. As we found with Thrift, letting
developers write code in their language of choice has many benefits,
including development sp
It is provided because of point (2).
But that doesn't necessarily make it a good thing to do. The basic idea has
real problems.
Hive is likely to resolve many of these issues (when it becomes publicly
available) but some are inherent with the basic idea of moving data across
language barriers t
This brings up two interesting issues:
1. Hadoop streaming is a potentially very powerful tool, especially for
those of us who don't work in Java for whatever reason
2. If Hadoop streaming is "at best a jury rigged solution" then that should
be made known somewhere on the wiki. If it's really not
This seems a bit surprising. In my experience well-written Java is
generally just about as fast as C++, especially for I/O bound work. The
exceptions are:
- java startup is still slow. This shouldn't matter much here because you
are using streaming anyway so you have java startup + C startup.
Well, we would like to use hadoop streaming because our current system is in
C++ and it is easier to migrate to hadoop streaming. Also we have very
strict performance requirements. Java seems to be too slow. I rewrote the
first program in Java and it runs 4 to 5 times slower than the C++ one.
On M
Hadoop can't split a gzipped file so you will only get as many maps as you
have files.
Why the obsession with hadoop streaming? It is at best a jury rigged
solution.
On 3/31/08 3:12 PM, "lin" <[EMAIL PROTECTED]> wrote:
> Does Hadoop automatically decompress the gzipped file? I only have a si
My previous java.opts was actually "--server -Xmx512m". I increased the heap
size to 1024M and the running time was about the same. The resident sizes of
the java processes seem to be no greater than 150M.
On Mon, Mar 31, 2008 at 1:56 PM, Theodore Van Rooy <[EMAIL PROTECTED]>
wrote:
> try extendi
Does Hadoop automatically decompress the gzipped file? I only have a single
input file. Does it have to be splitted and then gzipped?
I gzipped the input file and Hadoop only created one map task. Still java is
using more than 90% CPU.
On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <[EMAIL PRO
Hi,
I am using default settings from hadoop-default.xml and hadoop-site.xml
And I just changed this port number
mapred.task.tracker.report.address
I created the firewall rule to allow port range 5:50100 between the slaves
and master.
But reduce on the slaves using some other ports
try extending the java heap size as well.. I'd be interested to see what
kind of an effect that has on time (if any).
On Mon, Mar 31, 2008 at 2:30 PM, lin <[EMAIL PROTECTED]> wrote:
> I'm running custom map programs written in C++. What the programs do is
> very
> simple. For example, in progra
Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
provide the input files gzipped. Not great difference (e.g. 50% slower
when not gzipped, plus it took more than twice as long to upload the
data to HDFS-on-S3 in the first place), but still probably relevant.
Andreas
Am Montag, den
I'm running custom map programs written in C++. What the programs do is very
simple. For example, in program 2, for each input lineID node1 node2
... nodeN
the program outputs
node1 ID
node2 ID
...
nodeN ID
Each node has 4GB to 8GB of memory. The java memory
Hi,
Can I pass a directory having subdirectories ( which further have
subdirectories) to hadoop as input path ?
I tried it, but I got error :(
-Taran
are you running a custom map script or a standard linux command like WC? If
custom, what does your script do?
How much ram do you have? what are you Java memory settings?
I used the following setup
2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 task
max.
I got the follo
Hi,
I am looking into using Hadoop streaming to parallelize some simple
programs. So far the performance has been pretty disappointing.
The cluster contains 5 nodes. Each node has two CPU cores. The task capacity
of each node is 2. The Hadoop version is 0.15.
Program 1 runs for 3.5 minutes on th
I've set up a job to run on my small 4 (sometimes 5) node cluster on dual
processor server boxes with 2-8GB of memory.
My job processes 24 100-300MB files that are a days worth of logs, total
data is about 6GB.
I've modified the word count example to do what I need, and it works fine on
small tes
Iván,
Whether this was expected or an error depends on what happened on the
client. This could happen and would not be a bug if client was killed
for some other reason for e.g. But if client is also similarly surprised
then its a different case.
You could grep for this block in NameNode log
Sweet, did not see that in Jira/did not relate the errors. Now time to wait
until 0.16.2
On Mon, Mar 31, 2008 at 7:33 AM, Tom White <[EMAIL PROTECTED]> wrote:
> Hi Jake,
>
> Yes, this is a known issue that is fixed in 0.16.2 - see
> https://issues.apache.org/jira/browse/HADOOP-3027.
>
> Tom
>
>
Hi all,
Currently I'm able to run map-reduce jobs from box where NameNode and
JobTracker are running. But I'd like to run my jobs from separate box,
from which I have access to HDFS. I have updated params fs.default.name
and mapred.job.tracker in local hadoop dir to point to the clusters
mas
Hi Jake,
Yes, this is a known issue that is fixed in 0.16.2 - see
https://issues.apache.org/jira/browse/HADOOP-3027.
Tom
On 31/03/2008, Jake Thompson <[EMAIL PROTECTED]> wrote:
> So I am new to hadoop, but everything is working well so far.
> Except.
> I can use S3 fs in 15.3 without a pro
Thanks,
I have tried with the trunk version and now the exception "Trying to
change block file offset of block blk_... to ... but actual size of file
is ..." has disappeared and the jobs don't seems to get blocked.
But I have another "Broken Pipe" and "EOF" exceptions in the dfs logs.
They seems
Hi ,
You have to set the "garl-intel2" as a namenode parameter for
fs.default.name into hadoop-site.xml
Just follow this URL :
http://hadoop.apache.org/core/docs/current/quickstart.html
for installing the hadoop and configuring the hadoop parameters .
---
Peeyush
On Mon, 2008-03-31 at 14
Hi Peeyush,
Thanks a lot for your reply.
Can you tell me how to go about adding garl-intel2 to the hadoop-default.xml
?
Is there any document available for doing that?
Thanking you.
On Mon, Mar 31, 2008 at 2:26 PM, Peeyush Bishnoi <[EMAIL PROTECTED]>
wrote:
> Hello Raghvendra ,
>
> It is working
Hello Raghvendra ,
It is working with default because "default" parameter used the
configured file system from hadoop-site/hadoop-default.xml . As your
localhost and garl-intel2 is not configured with
hadoop-site/hadoop-default.xml , it is throwing the error.
So with hdfsConnect you have to use H
Hi,
When I connect using the code
hdfsFS fs = hdfsConnect("default", 0);
The program works successfully and am able to save the data also. But when I
connect using the code
hdfsFS fs = hdfsConnect("localhost", 0);
(or)
hdfsFS fs = hdfsConnect("garl-intel2", 9000);
I get the following error.
08/03/
39 matches
Mail list logo