Ian Soboroff wrote:
I created a JIRA (https://issues.apache.org/jira/browse/HADOOP-5615)
with a spec file for building a 0.19.1 RPM.
I like the idea of Cloudera's RPM file very much. In particular, it has
nifty /etc/init.d scripts and RPM is nice for managing updates.
However, it's for an
Snehal Nagmote wrote:
can you please explain exactly adding NIO bridge means what and how it can be
done , what could
be advantages in this case ?
NIO: java non-blocking IO. It's a standard API to talk to different
filesystems; support has been discussed in jira. If the DFS APIs were
Brian Bockelman wrote:
On Apr 2, 2009, at 3:13 AM, zhang jianfeng wrote:
seems like I should pay for additional money, so why not configure a
hadoop
cluster in EC2 by myself. This already have been automatic using script.
Not everyone has a support team or an operations team or enough
On Fri, 2009-04-03 at 11:19 +0100, Steve Loughran wrote:
True, but this way nobody gets the opportunity to learn how to do it
themselves, which can be a tactical error one comes to regret further
down the line. By learning the pain of cluster management today, you get
to keep it under
Hello all,
Following recent hardware discussions, I thought I'd ask a related
question. Our cluster nodes have 3 drives: 1x 160GB system/scratch and
2x 500GB DFS drives.
The 160GB system drive is partitioned such that 100GB is for job
mapred.local space. However, we find that for our
If you guys want to spin RPMs for the community, that's great. My main
motivation was that I wanted the current version rather than 0.18.3.
There is of course (as Steve points out) a larger discussion about if
you want RPMs, what should be in them. In particular, some might want
to include the
Steve Loughran ste...@apache.org writes:
I think from your perpective it makes sense as it stops anyone getting
itchy fingers and doing their own RPMs.
Um, what's wrong with that?
Ian
Steve Loughran ste...@apache.org writes:
-RPM and deb packaging would be nice
Indeed. The best thing would be to have the hadoop build system output
them, for some sensible subset of systems.
-the jdk requirements are too harsh as it should run on openjdk's JRE
or jrockit; no need for sun
On Thu, Apr 2, 2009 at 4:13 AM, zhang jianfeng zjf...@gmail.com wrote:
seems like I should pay for additional money, so why not configure a hadoop
cluster in EC2 by myself. This already have been automatic using script.
Personally, I'm excited about this. They're charging a tiny fraction
above
I may be wrong but I would welcome this. As far as I understand the hot
topic in cloud computing these days is standardization ... and I would be
happy if Hadoop could be considered as a standard for cloud computing
architecture. So the more Amazon pushes Hadoop the more it could be accepted
by
I disagree. This is like arguing that everyone should learn everything
otherwise they don't know how to do everything.
A better situation is having the algorithm designer just focusing in how to
break down their algorithm into Map/Reduce form and test it out immediately,
rather than requiring
Thanks for the comments, Matei.
The machines I ran the experiments have 16 GB memory each. I don't see
how 64 MB buffer could be huge or is bad for memory consumption. In
fact, I set it to much larger value after initial rounds of tests showed
abysmal results using the default 64 KB buffer. Also,
Has anyone benchmark the performance difference of using Hadoop ?
1) Java vs C++
2) Java vs Streaming
From looking at the Hadoop architecture, since TaskTracker will fork a
separate process anyway to run the user supplied map() and reduce() function,
I don't see the performance overhead of
On Apr 3, 2009, at 9:42 AM, Ricky Ho wrote:
Has anyone benchmark the performance difference of using Hadoop ?
1) Java vs C++
2) Java vs Streaming
Yes, a while ago. When I tested it using sort, Java and C++ were
roughly equal and streaming was 10-20% slower. Most of the cost with
Hadoop does checksums for each small chunk of the file (512 bytes by
default) and stores a list of checksums for each chunk with the file, rather
than storing just one checksum at the end. This makes it easier to read only
a part of a file and know that it's not corrupt without having to read and
Owen, thanks for your elaboration, the data point is very useful.
On your point ...
In java you get
key1, (value1, value2, ...)
key2, (value3, ...)
in streaming you get
key1 value1
key1 value2
On Apr 3, 2009, at 10:35 AM, Ricky Ho wrote:
I assume that the key is still sorted, right ? That mean I will get
all the key1, valueX entries before getting any of the key2
valueY entries and key2 is always bigger than key1.
Yes.
-- Owen
Thanks for the update and suggestion, Matei.
I can certainly construct an input text file containing all the files of
a dataset
(http://hadoop.apache.org/core/docs/r0.19.1/streaming.html#How+do+I+proc
ess+files%2C+one+per+map%3F), then let the jobtracker dispatch the file
names to the maps, and
On Apr 3, 2009, at 1:41 PM, Tu, Tiankai wrote:
By the way, what is the largest size---in terms of total bytes, number
of files, and number of nodes---in your applications? Thanks.
The largest Hadoop application that has been documented is the Yahoo
Webmap.
10,000 cores
500 TB shuffle
300
I guess the performance will be bad, but we should still be able to read/write
the file. Correct?
Why do we throw an Exception?
Zheng
On Fri, 2009-04-03 at 09:42 -0700, Ricky Ho wrote:
1) I can pick the language that offers a different programming
paradigm (e.g. I may choose functional language, or logic programming
if they suit the problem better). In fact, I can even chosen Erlang
at the map() and Prolog at the
21 matches
Mail list logo