Re: Why single thread for HDFS?

2010-07-06 Thread elton sky
Steve, Seems HP has done block based parallel reading from different datanodes. Though not from disk level, they achieve 4Gb/s rate with 9 readers (500Mb/s each). I didn't see anywhere I can download their code to play around, pity~ BTW, can we specify which disk to read from with Java? On Wed,

Re: Is org.apache.hadoop.mapred.lib.MultipleOutputFormat deprecated?

2010-07-06 Thread Ted Yu
Usually MultipleSequenceFileOutputFormat or MultipleTextOutputFormat is used. You need to use: jobConf.setOutputFormat() On Mon, Jul 5, 2010 at 1:29 AM, zhangguoping zhangguoping < zhangguopin...@gmail.com> wrote: > Hi, > > Is org.apache.hadoop.mapred.lib.MultipleOutputFormat deprecated? I did n

Re: MapReduce HBASE examples

2010-07-06 Thread Jean-Daniel Cryans
(moving the thread to the HBase user mailing list, on reply please remove the general@ since this is not a general question) It is indeed a parallelizable problem that could use a job management system, but in your case I don't think MR is the right solution. You will have to do all sorts weird tw

Re: Chaining Map-Reduce

2010-07-06 Thread Grant Mackey
Oh, of course it will. From what I know about using this it just allows you to launch 1 job instead of several. Every time a Map/Reduce pair finishes it will dump to the HDFS (or whatever you're using) - Grant Quoting abc xyz : one option can be to use multiple chained jobs using JobContr

RE: MapReduce HBASE examples

2010-07-06 Thread Kilbride, James P.
I'm assuming the rows being pulled back are smaller than the full row set of the entire database. So say the 10 out of 2B case. But, each row has a column family who's 'columns' are actually rowIds in the database. (basically my one to many relationship mapping). I'm not trying to use MR for the

Re: MapReduce HBASE examples

2010-07-06 Thread Jean-Daniel Cryans
That won't be very efficient either... are you trying to do this for a real time user request. If so, it really isn't the way you want to go. If you are in a batch processing situation, I'd say it depends on how many rows you have VS how many you need to retrieve eg scanning 2B rows only to find 1

RE: MapReduce HBASE examples

2010-07-06 Thread Kilbride, James P.
So, if that's the case, and you argument makes sense understanding how scan versus get works, I'd have to write a custom InputFormat class that looks like the TableInputFormat class, but uses a get(or series of gets) rather than a scan object as the current table mapper does? James Kilbride -

Re: MapReduce HBASE examples

2010-07-06 Thread Jean-Daniel Cryans
> > > Does this make any sense? > > Not in a MapReduce context, what you want to do is a LIKE with a bunch of values right? Since a mapper will always read all the input that it's given (minus some filters like you can do with HBase), whatever you do will always end up being a full table scan. You

RE: MapReduce HBASE examples

2010-07-06 Thread Kilbride, James P.
This is an interesting start but I'm really interested in the opposite direction, where hbase is the input to my map reduce job, and then I'm going to push some data into reducers which ultimately I'm okay with them just writing it to a file. I get the impression that I need to set up a TableIn

Re: MapReduce HBASE examples

2010-07-06 Thread Harsh J
I believe this article will help you understand the new (not anymore) API+HBase MR: http://kdpeterson.net/blog/2009/09/minimal-hbase-mapreduce-example.html [Look at the second example, which uses the Put object] On Tue, Jul 6, 2010 at 6:08 PM, Kilbride, James P. wrote: > All, > > The examples in

Re: Why single thread for HDFS?

2010-07-06 Thread Steve Loughran
Michael Segel wrote: Uhm... That's not really true. It gets a bit more complicated than that. If you're talking about M/R jobs, you don't want to do threads in your map() routine, while this is possible, its going to be really hard to justify the extra parallelism along with the need to wait fo

RE: Why single thread for HDFS?

2010-07-06 Thread Michael Segel
If all you want to do is to have a faster -cp option, then if you know your intial block list and location, you need to generate the target bloc list and then create a single thread per block and process each block in a separate thread. You don't need to use the local disk and just read/write

RE: Why single thread for HDFS?

2010-07-06 Thread Michael Segel
Uhm... That's not really true. It gets a bit more complicated than that. If you're talking about M/R jobs, you don't want to do threads in your map() routine, while this is possible, its going to be really hard to justify the extra parallelism along with the need to wait for all of the threads

MapReduce HBASE examples

2010-07-06 Thread Kilbride, James P.
All, The examples in the hbase examples, and on the hadoop wiki all reference the deprecated interfaces of the mapred package. Are there any examples of how to use hbase as the input for a mapreduce job, that uses the mapreduce package instead? I'm looking to set up a job which will read from a

how to find the id of each map task?

2010-07-06 Thread Denim Live
Hello I want to get the id of each mapper and reducer task because I want to tag the output of these mappers and reducers according to the mapper and reducer id. How can I retrieve the ids of each? Thanks

Re: Distributed Cache

2010-07-06 Thread Amareshwari Sri Ramadasu
If they are already in JT's FS, it does not copy them. It copies them only if they are on local FS or some other FS. On 7/6/10 2:23 PM, "zhangguoping zhangguoping" wrote: I wonder why hadoop wants to copy the files to jobtracker's filesystem. Since it is already in HDFS, it should be availabl

Re: Distributed Cache

2010-07-06 Thread Hemanth Yamijala
Hi, > From the book: "Hadoop The definitive guide" -- P242 >>> > When you launch a job, Hadoop copies the files specified by the -files and > -archives options to the jobtracker’s filesystem (normally HDFS). Then, > before a task > is run, the tasktracker copies the files from the jobtracker’s fil

Distributed Cache

2010-07-06 Thread zhangguoping zhangguoping
>From the book: "Hadoop The definitive guide" -- P242 >> When you launch a job, Hadoop copies the files specified by the -files and -archives options to the jobtracker’s filesystem (normally HDFS). Then, before a task is run, the tasktracker copies the files from the jobtracker’s filesystem to a lo

Re: Partitioned Datasets Map/Reduce

2010-07-06 Thread abc xyz
well, I want to do some experimentation with hadoop. I need to partition two datasets using same partitioning function and then in the next job, take the same partition from both datasets and apply some operation in the mapper. But how to ensure to get the same partition from both sources in o

Re: Why single thread for HDFS?

2010-07-06 Thread Gautam Singaraju
To add to Jay Booth's points, adding multi-threaded capability to HDFS will bring down the performance. Consider a production server where 4-5 jobs are running on a low-end commodity server. Currently, that is 4 threads reading and writing from the hard disk. Making it a multi-threaded read and wri