Bobby,
Thanks for such a thoughtful response.
I have a data set that represents all the people that pass through Las Vegas
over a course of time, say five years, which comes to about 175 - 200
million people. Each record is a person, and it contains fields for where
they came from, left to; tim
Geoffry,
That really depends on how much data you are processing, and the algorithm you
need to use to process the data. I did something similar a while ago with a
medium amount of data and we saw significant speed up by first assigning each
record a new key based off of the expected range of
All,
I am mostly seeking confirmation as to my thinking on this matter.
I have an MR job that I believe will force me into using a single reducer.
The nature of the process is one where calculations performed on a given
record rely on certain accumulated values whose calculation depends on
rollin
Even for a single machine (and there may be reasons to use a single machine
if the original data is not splittable) Our experience suggests it should
take about an hour to process 32 GB on a single machine leading me to wonder
whether writing the Sequence file is your limiting step - Consider very
Oops, sorry, I answered in the wrong thread. I intended to reply to the "How to
create a SequenceFile faster" issue.
Regards,
Christoph
-Ursprüngliche Nachricht-
Von: 丛林 [mailto:congli...@gmail.com]
Gesendet: Donnerstag, 12. Mai 2011 14:30
An: mapreduce-user@hadoop.apache.org
Betreff: R
Hi Christoph,
If there is no reducer, how can these sequence files be merged?
Thanks for you advice.
Best Wishes,
-Lin
在 2011年5月12日 下午7:44,Christoph Schmitz 写道:
> Hi Lin,
>
> you could run a map-only job, i.e. read your data and output it from the
> mapper without any reducer at all (set map
Hi Lin,
you could run a map-only job, i.e. read your data and output it from the mapper
without any reducer at all (set mapred.reduce.tasks=0 or, equivalently, use
job.setNumReduceTasks(0)).
That way, you parallelize over your inputs through a number of mappers and do
not have any sort/shuffle
Dear Jason,
If the order of the keys in sequence file is not important to me, in
other words, the sort process is not necessary, how can I stop the
distributed sort to save the consumption of resource?
Thanks for your suggestion.
Best Wishes,
-Lin
2011/5/12 jason :
> M/R job with a single redu
Dear Harsh,
Will you please explain how to create a sequence file in the way of mapreduce?
Suppose that all 32G little file stored in one PC.
Thanks for your suggestion.
BTW: I notice that you repeated most of the topic of sequence file in
this mail-list :-)
Best Wishes,
-Lin
2011/5/12 Hars