Thanks Bryan. This is great stuff!
On Thu, Sep 12, 2013 at 8:49 PM, Bryan Beaudreault bbeaudrea...@hubspot.com
wrote:
Hey Adrian,
To clarify, the replication happens on *write*. So as you write output
from the reducer of Job A, you are writing into hdfs. Part of that write
path is
I've just seen your email, Vinod. This is the behaviour that I'd expect and
similar to other data integration tools; I will keep an eye out for it as a
long term option.
On Fri, Sep 13, 2013 at 5:26 AM, Vinod Kumar Vavilapalli vino...@apache.org
wrote:
Other than the short term solutions
Hey Adrian,
To clarify, the replication happens on *write*. So as you write output
from the reducer of Job A, you are writing into hdfs. Part of that write
path is replicating the data to 2 additional hosts in the cluster (local +
2, this is configured by dfs.replication configuration value).
Other than the short term solutions that others have proposed, Apache Tez
solves this exact problem. It can M-M-R-R-R chains, and mult-way mappers and
reducers, and your own custom processors - all without persisting the
intermediate outputs to HDFS.
It works on top of YARN, though the first
Howdy,
My application requires 2 distinct processing steps (reducers) to be
performed on the input data. The first operation generates changes the key
values and, records that had different keys in step 1 can end up having the
same key in step 2.
The heavy lifting of the operation is in step1
If you want to stay in Java look at Cascading. Pig is also helpful. I think
there are other (Spring integration maybe?) but I'm not familiar with them
enough to make a recommendation.
Note that with Cascading and Pig you don't write 'map reduce' you write
logic and they map it to the various
Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
prefer to keep, if possible, everything as close to the hadoop libraries.
I am sure I am overlooking something basic as repartitioning is a fairly
common operation in MPP environments.
On Thu, Sep 12, 2013 at 2:39 PM,
The temporary file solution will work in a single node configuration, but
I'm not sure about an MPP config.
Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
both jobs run on all 4 nodes - will HDFS be able to redistribute
automagically the records between nodes or does
It really comes down to the following:
In Job A set mapred.output.dir to some directory X.
In Job B set mapred.input.dir to the same directory X.
For Job A, do context.write() as normally, and each reducer will create an
output file in mapred.output.dir. Then in Job B each of those will
Thanks Bryan.
Yes, I am using hadoop + hdfs.
If I understand your point, hadoop tries to start the mapping processes on
nodes where the data is local and if that's not possible, then it is hdfs
that replicates the data to the mapper nodes?
I expected to have to set up this in the code and I
Cascading would a good option in case you have a complex flow. However, in your
case, you are trying to chain two jobs only. I would suggest you to follow
these steps.
1. The output directory of Job1 would be set at the input directory for Job2.
2. Launch Job1 using the new API. In launcher
11 matches
Mail list logo