Re: OutOfMemory Error

2008-09-19 Thread Edward J. Yoon
Great experience! /Edward On Fri, Sep 19, 2008 at 2:50 PM, Palleti, Pallavi [EMAIL PROTECTED] wrote: Yeah. That was the problem. And Hama can be surely useful for large scale matrix operations. But for this problem, I have modified the code to just pass the ID information and read the

Re: Data corruption when using Lzo Codec

2008-09-19 Thread Chris Douglas
It's probably not corrupted. If by compressed lzo file you mean something readable with lzop, you should use LzopCodec, not LzoCodec. LzoCodec doesn't write header information required by that tool. Guessing at the output format (length encoded blocks of data compressed by the lzo

Re: Data corruption when using Lzo Codec

2008-09-19 Thread Alex Feinberg
Hi Chris, I was also unable to decompress by simply doing a map/reducer with cat as a mapper and then doing dfs -get either. I will try using LzopCodec. Thanks, - Alex On Fri, Sep 19, 2008 at 2:34 AM, Chris Douglas [EMAIL PROTECTED] wrote: It's probably not corrupted. If by compressed lzo

Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

2008-09-19 Thread Manish Katyal
Hi Chris, Have a look at Cassandra (from Facebook). [ http://code.google.com/p/the-cassandra-project/] Its a BigTable implementation based on Amazon Dynamo (its completely decentralized/P2P with no single points of failure). You can import data into it very quickly (its got asynch and synchronous

Do all Mapper outputs with same key go to same Reducer?

2008-09-19 Thread Stuart Sierra
Hi all, The short version of my question is in the subject. Here's the long version: I have two map/reduce jobs that output records using a common key: Job A: K1 = A1,1 K1 = A1,2 K2 = A2,1 K2 = A2,2 Job B: K1 = B1 K2 = B2 K3 = B3 And a third job that merges records with the same

Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

2008-09-19 Thread Stuart Sierra
On Thu, Sep 18, 2008 at 1:05 AM, Chris Dyer [EMAIL PROTECTED] wrote: Basically, I'd like to be able to load the entire contents of a file key-value map file in DFS into memory across many machines in my cluster so that I can access any of it with ultra-low latencies. I think the simplest way,

Re: Do all Mapper outputs with same key go to same Reducer?

2008-09-19 Thread Miles Osborne
So here's my question -- does Hadoop guarantee that all records with the same key will end up in the same Reducer task? If that's true, yes --think of the record as being sent to the task by hashing over the key Miles 2008/9/19 Stuart Sierra [EMAIL PROTECTED]: Hi all, The short version of

Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

2008-09-19 Thread Miles Osborne
the problem here is that you don't want each mapper/reducer to have a copy of the data. you want that data --which can be very large-- stored in a distributed manner over your cluster and allow random access to it during computation. (this is what HBase etc do) Miles 2008/9/19 Stuart Sierra

Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

2008-09-19 Thread Andrzej Bialecki
Miles Osborne wrote: the problem here is that you don't want each mapper/reducer to have a copy of the data. you want that data --which can be very large-- stored in a distributed manner over your cluster and allow random access to it during computation. (this is what HBase etc do) I had a

Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

2008-09-19 Thread James Moore
On Wed, Sep 17, 2008 at 10:05 PM, Chris Dyer [EMAIL PROTECTED] wrote: I'm looking for a lightweight way to serve data stored as key-value pairs in a series of MapFiles or SequenceFiles. Might be worth taking a look at CouchDB as well. Haven't used it myself, so can't comment on how it might

Is pipes really supposed to be a serious API? Is it being actively developed?

2008-09-19 Thread Marc Vaillant
Only about 5 pipes/c++ related posts since mid July, and basically no responses. Is anyone really using or actively developing pipes? We've invested some time to make it platform independent (ported bsd sockets to boost sockets, and the xdr serialization to boost serialization), but it's still

Re: Is pipes really supposed to be a serious API? Is it being actively developed?

2008-09-19 Thread David Richards
I've had the same problem, when wanting to integrate pipes into my system. I haven't seen serious support/comment on pipes, so I'm seeing if I can steer clear of this. Maybe this is a wakeup call to see if we've both missed something. David On Sep 19, 2008, at 12:10 PM, Marc Vaillant

Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

2008-09-19 Thread Alex Feinberg
Do any of CouchDB/Cassandra/other frameworks specifically do in-memory serving? I haven't found any that do this explicitly. For now I've been using memcached for that functionality (with the usual memcached caveats). Ehcache may be another memcache-like solution

Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

2008-09-19 Thread Chris Dyer
Memcached looks like it would be a reasonable solution for my problem, although it's not optimal since it doesn't support an easy way of initializing itself at start up, but I can work around that. This may be wishful thinking, but does anyone have any experience using the Hadoop job/task

Re: Do all Mapper outputs with same key go to same Reducer?

2008-09-19 Thread Per Jacobsson
If that's true, then can I set the number of Reducers very high (even equal to the number of maps) to make Job C go faster? This page has some good info on finding the right number of reducers: http://wiki.apache.org/hadoop/HowManyMapsAndReduces / Per On Fri, Sep 19, 2008 at 9:42 AM, Miles

Re: no speed-up with parallel matrix calculation

2008-09-19 Thread Miles Osborne
if each mapper only sees a relatively small chunk of the data, then why not have each one compute the counting of 2-perms in memory. you would then get the reducer to merge these partial results together. (details are left to the reader ...) Miles 2008/9/19 Sandy [EMAIL PROTECTED]: Hi, I

Re: no speed-up with parallel matrix calculation

2008-09-19 Thread Sandy
Miles, Thanks for your response. I think I understand.. basically, I'm adding a combiner class that computes the partial results in phase 2. correct (just like in the word count example)? However, even if I do that, I don't think it gets rid of the overhead of reading 48 GB from disk back into

Re: no speed-up with parallel matrix calculation

2008-09-19 Thread Sandy
Thank you for the link, Edward. I'll take a look at HAMA. Does anyone knwo if there is a way to limit the upper bound of maps being produced? I see now that mapred.tasktracker.tasks.maximum really does not limit the number of maps, as the number of maps is determined by InputFormat. Aside from