Great experience!
/Edward
On Fri, Sep 19, 2008 at 2:50 PM, Palleti, Pallavi
[EMAIL PROTECTED] wrote:
Yeah. That was the problem. And Hama can be surely useful for large scale
matrix operations.
But for this problem, I have modified the code to just pass the ID
information and read the
It's probably not corrupted. If by compressed lzo file you mean
something readable with lzop, you should use LzopCodec, not LzoCodec.
LzoCodec doesn't write header information required by that tool.
Guessing at the output format (length encoded blocks of data
compressed by the lzo
Hi Chris,
I was also unable to decompress by simply doing a map/reducer with cat
as a mapper and then doing dfs -get either.
I will try using LzopCodec.
Thanks,
- Alex
On Fri, Sep 19, 2008 at 2:34 AM, Chris Douglas [EMAIL PROTECTED] wrote:
It's probably not corrupted. If by compressed lzo
Hi Chris,
Have a look at Cassandra (from Facebook). [
http://code.google.com/p/the-cassandra-project/]
Its a BigTable implementation based on Amazon Dynamo (its completely
decentralized/P2P with no single points of failure). You can import data
into it very quickly (its got asynch and synchronous
Hi all,
The short version of my question is in the subject. Here's the long version:
I have two map/reduce jobs that output records using a common key:
Job A:
K1 = A1,1
K1 = A1,2
K2 = A2,1
K2 = A2,2
Job B:
K1 = B1
K2 = B2
K3 = B3
And a third job that merges records with the same
On Thu, Sep 18, 2008 at 1:05 AM, Chris Dyer [EMAIL PROTECTED] wrote:
Basically, I'd like to be able to
load the entire contents of a file key-value map file in DFS into
memory across many machines in my cluster so that I can access any of
it with ultra-low latencies.
I think the simplest way,
So here's my question -- does Hadoop guarantee that all records with
the same key will end up in the same Reducer task? If that's true,
yes --think of the record as being sent to the task by hashing over the key
Miles
2008/9/19 Stuart Sierra [EMAIL PROTECTED]:
Hi all,
The short version of
the problem here is that you don't want each mapper/reducer to have a
copy of the data. you want that data --which can be very large--
stored in a distributed manner over your cluster and allow random
access to it during computation.
(this is what HBase etc do)
Miles
2008/9/19 Stuart Sierra
Miles Osborne wrote:
the problem here is that you don't want each mapper/reducer to have a
copy of the data. you want that data --which can be very large--
stored in a distributed manner over your cluster and allow random
access to it during computation.
(this is what HBase etc do)
I had a
On Wed, Sep 17, 2008 at 10:05 PM, Chris Dyer [EMAIL PROTECTED] wrote:
I'm looking for a lightweight way to serve data stored as key-value
pairs in a series of MapFiles or SequenceFiles.
Might be worth taking a look at CouchDB as well. Haven't used it
myself, so can't comment on how it might
Only about 5 pipes/c++ related posts since mid July, and basically no
responses. Is anyone really using or actively developing pipes? We've
invested some time to make it platform independent (ported bsd sockets
to boost sockets, and the xdr serialization to boost serialization), but
it's still
I've had the same problem, when wanting to integrate pipes into my
system. I haven't seen serious support/comment on pipes, so I'm
seeing if I can steer clear of this. Maybe this is a wakeup call to
see if we've both missed something.
David
On Sep 19, 2008, at 12:10 PM, Marc Vaillant
Do any of CouchDB/Cassandra/other frameworks specifically do in-memory serving?
I haven't found any that do this explicitly. For now I've been using
memcached for that
functionality (with the usual memcached caveats).
Ehcache may be another memcache-like solution
Memcached looks like it would be a reasonable solution for my problem,
although it's not optimal since it doesn't support an easy way of
initializing itself at start up, but I can work around that. This may
be wishful thinking, but does anyone have any experience using the
Hadoop job/task
If that's true, then can I set the number of Reducers very high
(even equal to the number of maps) to make Job C go faster?
This page has some good info on finding the right number of reducers:
http://wiki.apache.org/hadoop/HowManyMapsAndReduces
/ Per
On Fri, Sep 19, 2008 at 9:42 AM, Miles
if each mapper only sees a relatively small chunk of the data, then
why not have each one compute the counting of 2-perms in memory.
you would then get the reducer to merge these partial results together.
(details are left to the reader ...)
Miles
2008/9/19 Sandy [EMAIL PROTECTED]:
Hi,
I
Miles,
Thanks for your response.
I think I understand.. basically, I'm adding a combiner class that computes
the partial results in phase 2. correct (just like in the word count
example)?
However, even if I do that, I don't think it gets rid of the overhead of
reading 48 GB from disk back into
Thank you for the link, Edward. I'll take a look at HAMA.
Does anyone knwo if there is a way to limit the upper bound of maps being
produced? I see now that mapred.tasktracker.tasks.maximum really does not
limit the number of maps, as the number of maps is determined by
InputFormat. Aside from
18 matches
Mail list logo