Steve,
Seems HP has done block based parallel reading from different datanodes.
Though not from disk level, they achieve 4Gb/s rate with 9 readers (500Mb/s
each).
I didn't see anywhere I can download their code to play around, pity~
BTW, can we specify which disk to read from with Java?
On Wed,
Usually MultipleSequenceFileOutputFormat or MultipleTextOutputFormat is
used.
You need to use:
jobConf.setOutputFormat()
On Mon, Jul 5, 2010 at 1:29 AM, zhangguoping zhangguoping <
zhangguopin...@gmail.com> wrote:
> Hi,
>
> Is org.apache.hadoop.mapred.lib.MultipleOutputFormat deprecated? I did n
(moving the thread to the HBase user mailing list, on reply please remove
the general@ since this is not a general question)
It is indeed a parallelizable problem that could use a job management
system, but in your case I don't think MR is the right solution. You will
have to do all sorts weird tw
Oh, of course it will. From what I know about using this it just
allows you to launch 1 job instead of several. Every time a Map/Reduce
pair finishes it will dump to the HDFS (or whatever you're using)
- Grant
Quoting abc xyz :
one option can be to use multiple chained jobs using JobContr
I'm assuming the rows being pulled back are smaller than the full row set of
the entire database. So say the 10 out of 2B case. But, each row has a column
family who's 'columns' are actually rowIds in the database. (basically my one
to many relationship mapping). I'm not trying to use MR for the
That won't be very efficient either... are you trying to do this for a real
time user request. If so, it really isn't the way you want to go.
If you are in a batch processing situation, I'd say it depends on how many
rows you have VS how many you need to retrieve eg scanning 2B rows only to
find 1
So, if that's the case, and you argument makes sense understanding how scan
versus get works, I'd have to write a custom InputFormat class that looks like
the TableInputFormat class, but uses a get(or series of gets) rather than a
scan object as the current table mapper does?
James Kilbride
-
>
>
> Does this make any sense?
>
>
Not in a MapReduce context, what you want to do is a LIKE with a bunch of
values right? Since a mapper will always read all the input that it's given
(minus some filters like you can do with HBase), whatever you do will always
end up being a full table scan. You
This is an interesting start but I'm really interested in the opposite
direction, where hbase is the input to my map reduce job, and then I'm going to
push some data into reducers which ultimately I'm okay with them just writing
it to a file.
I get the impression that I need to set up a TableIn
I believe this article will help you understand the new (not anymore)
API+HBase MR:
http://kdpeterson.net/blog/2009/09/minimal-hbase-mapreduce-example.html
[Look at the second example, which uses the Put object]
On Tue, Jul 6, 2010 at 6:08 PM, Kilbride, James P.
wrote:
> All,
>
> The examples in
Michael Segel wrote:
Uhm...
That's not really true. It gets a bit more complicated than that.
If you're talking about M/R jobs, you don't want to do threads in your map() routine, while this is possible, its going to be really hard to justify the extra parallelism along with the need to wait fo
If all you want to do is to have a faster -cp option, then if you know your
intial block list and location, you need to generate the target bloc list and
then create a single thread per block and process each block in a separate
thread.
You don't need to use the local disk and just read/write
Uhm...
That's not really true. It gets a bit more complicated than that.
If you're talking about M/R jobs, you don't want to do threads in your map()
routine, while this is possible, its going to be really hard to justify the
extra parallelism along with the need to wait for all of the threads
All,
The examples in the hbase examples, and on the hadoop wiki all reference the
deprecated interfaces of the mapred package. Are there any examples of how to
use hbase as the input for a mapreduce job, that uses the mapreduce package
instead? I'm looking to set up a job which will read from a
Hello
I want to get the id of each mapper and reducer task because I want to tag the
output of these mappers and reducers according to the mapper and reducer id.
How can I retrieve the ids of each?
Thanks
If they are already in JT's FS, it does not copy them. It copies them only if
they are on local FS or some other FS.
On 7/6/10 2:23 PM, "zhangguoping zhangguoping" wrote:
I wonder why hadoop wants to copy the files to jobtracker's filesystem.
Since it is already in HDFS, it should be availabl
Hi,
> From the book: "Hadoop The definitive guide" -- P242
>>>
> When you launch a job, Hadoop copies the files specified by the -files and
> -archives options to the jobtracker’s filesystem (normally HDFS). Then,
> before a task
> is run, the tasktracker copies the files from the jobtracker’s fil
>From the book: "Hadoop The definitive guide" -- P242
>>
When you launch a job, Hadoop copies the files specified by the -files and
-archives options to the jobtracker’s filesystem (normally HDFS). Then,
before a task
is run, the tasktracker copies the files from the jobtracker’s filesystem to
a lo
well, I want to do some experimentation with hadoop. I need to partition two
datasets using same partitioning function and then in the next job, take the
same partition from both datasets and apply some operation in the mapper. But
how to ensure to get the same partition from both sources in o
To add to Jay Booth's points, adding multi-threaded capability to HDFS
will bring down the performance. Consider a production server where
4-5 jobs are running on a low-end commodity server. Currently, that is
4 threads reading and writing from the hard disk. Making it a
multi-threaded read and wri
20 matches
Mail list logo