Currently, Hadoop does round-robin allocation of blocks and data
across multiple JBOD disks. We did some testing and found that there
weren't significant differences between RAID-0 and JBOD. We went with
JBOD because we figured that RAID-0 has a higher failure rate than
JBOD -- any disk f
of Computer Science & Engineering, Korea University
1, 5-ga, Anam-dong, Seongbuk-gu, Seoul, 136-713, Republic of Korea
TEL : +82-2-3290-3580
-
On Tue, Oct 21, 2008 at 10:23 AM, Colin Evans <[EMAIL PROTECTED]> wrote:
Hi Edward,
At Metaweb, we're experimenting with storing raw triples in HDFS flat
files, and have written a simple query language and planner that
executes the queries with chained map-reduce jobs. This approach works
well for warehousing triple data, and doesn't require HBase. Queries
may ta
The trick is to amortize your computation over the whole set. So DFS
for a single node will always be faster on an in-memory graph, but
Hadoop is a good tool for computing all-pairs shortest paths in one shot
if you re-frame the algorithm as a belief propagation and message
passing algorithm.
At Freebase, we're mapping our large graphs into very large files of
triples in HDFS and running large queries over them.
Hadoop is optimized for processing streaming data off of disk, and we've
found that trying to load a multi-GB graph and then access it in a
Hadoop task has scaling problems
:
Unfortunately, setting those environment variables did not help my
issue. It appears that the "HADOOP_LZO_LIBRARY" variable is not
defined in both LzoCompressor.c and LzoDecompressor.c. Where is this
variable supposed to be set?
On Sep 30, 2008, at 12:33 PM, Colin Evans wrote:
Hi N
adoop/io/compress/lzo/LzoCompressor.c:135:
error: syntax error before ',' token
[exec] make[2]: *** [LzoCompressor.lo] Error 1
[exec] make[1]: *** [all-recursive] Error 1
[exec] make: *** [all] Error 2
Any ideas?
On Sep 30, 2008, at 11:53 AM, Colin Evans wrote:
There
There's a patch to get the native targets to build on Mac OS X:
http://issues.apache.org/jira/browse/HADOOP-3659
You probably will need to monkey with LDFLAGS as well to get it to work,
but we've been able to build the native libs for the Mac without too
much trouble.
Doug Cutting wrote:
A
Freebase is finally open-sourcing our Jython-based framework for writing
map-reduce jobs on Hadoop. Happy tightly embeds Jython into the Hadoop
APIs, files off a lot of the sharp edges, and makes writing map-reduce
programs a breeze. This is the 0.1 release, but we've been using Happy
at Free
ning financial
data from the SEC in Freebase, a talk by Kurt Bollacker on data mining
Wikipedia, and at talk by Kirrily Robert on new features in Freebase.
Sign up if you're planning on coming - space can be limited.
http://upcoming.yahoo.com/event/760574
Thanks
Colin Evans
We're building a cluster of 40 machines with 5 drives each, and I'm
curious what people's experiences have been for using RAID-0 for HDFS
vs. configuring seperate partitions (JBOD) and having the datanode
balance between them.
I took a look at the datanode code, and datanodes appear to write b
At Metaweb, we did a lot of comparisons between streaming (using Python)
and native Java, and in general streaming performance was not much
slower than the native java -- most of the slowdown was from Python
being a slow language.
The main problems with streaming apps that we found are that th
Here's the code. If folks are interested, I can submit it as a patch as
well.
Prasan Ary wrote:
Colin,
Is it possible that you share some of the code with us?
thx,
Prasan
Colin Evans <[EMAIL PROTECTED]> wrote:
We ended up subclassing TextInputFormat and addi
We ended up subclassing TextInputFormat and adding a custom RecordReader
that starts and ends record reads on tags. The
StreamXmlRecordReader class is a good reference for this.
Prasan Ary wrote:
Hi All,
I am writing a java implementation for my map/reduce function on hadoop.
Input to th
On 2/12/08 12:19 PM, "Colin Evans" <[EMAIL PROTECTED]> wrote:
The big question for me is how well a dual-CPU 4-core (8 cores per box)
configuration will do. Has anyone tried out this configuration with
Intel or AMD CPUs? Is the memory throughput sufficient?
Because of acquiring servers of different capacities at different times,
we have 2 servers with 1TB of disk each, and 11 servers with ~300GB
each. The 1TB servers tend to be under-utilized by HDFS given their
capacity. This makes sense, as block replicas need to be relatively
evenly distribut
oks like a bunch of blocks
got allocated on the datanodes that the namdenode doesn't know about,
and the datanodes are refusing to work with new blocks that have the
same id. Does this sound likely? What's a good fix for this?
Thanks!
Colin Evans
Hi Ted,
I've been building out a similar framework in JavaScript (Rhino) for
work that I've been doing at MetaWeb, and we've been thinking about open
sourcing it too. It's pretty clear that there are major benefits to
using a dynamic scripting language with Hadoop.
I'd love too see how you'r
18 matches
Mail list logo