Re: chaining (the output of) jobs/ reducers

2013-09-17 Thread Adrian CAPDEFIER
Thanks Bryan. This is great stuff!


On Thu, Sep 12, 2013 at 8:49 PM, Bryan Beaudreault bbeaudrea...@hubspot.com
 wrote:

 Hey Adrian,

 To clarify, the replication happens on *write*.  So as you write output
 from the reducer of Job A, you are writing into hdfs.  Part of that write
 path is replicating the data to 2 additional hosts in the cluster (local +
 2, this is configured by dfs.replication configuration value).  So by the
 time Job B starts, hadoop has 3 options where each mapper can run and be
 data-local.  Hadoop will do all the work to try to make everything as local
 as possible.

 You'll be able to see from the counters on the job how successful hadoop
 was at placing your mappers.  See the counters Data-local map tasks and
 Rack-local map tasks.  Rack-local being those where hadoop was not able
 to place the mapper on the same host as the data, but was at least able to
 keep it within the same rack.

 All of this is dependent a proper topology configuration, both in your
 NameNode and JobTracker.


 On Thu, Sep 12, 2013 at 3:02 PM, Adrian CAPDEFIER 
 chivas314...@gmail.comwrote:

 Thanks Bryan.

 Yes, I am using hadoop + hdfs.

 If I understand your point, hadoop tries to start the mapping processes
 on nodes where the data is local and if that's not possible, then it is
 hdfs that replicates the data to the mapper nodes?

 I expected to have to set up this in the code and I completely ignored
 HDFS; I guess it's a case of not seeing the forest from all the trees!



  On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault 
 bbeaudrea...@hubspot.com wrote:

 It really comes down to the following:

 In Job A set mapred.output.dir to some directory X.
 In Job B set mapred.input.dir to the same directory X.

 For Job A, do context.write() as normally, and each reducer will create
 an output file in mapred.output.dir.  Then in Job B each of those will
 correspond to a mapper.

 Of course you need to make sure your input and output formats, as well
 as input and output keys/values, match up between the two jobs as well.

 If you are using HDFS, which it seems you are, the directories specified
 can be HDFS directories.  In that case, with a replication factor of 3,
 each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
 the work to ensure that the mappers in the second job do as good a job as
 possible to be data or rack-local.


 On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER 
 chivas314...@gmail.com wrote:

 Thank you, Chris. I will look at Cascading and Pig, but for starters
 I'd prefer to keep, if possible, everything as close to the hadoop
 libraries.

 I am sure I am overlooking something basic as repartitioning is a
 fairly common operation in MPP environments.


 On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin 
 curtin.ch...@gmail.comwrote:

 If you want to stay in Java look at Cascading. Pig is also helpful. I
 think there are other (Spring integration maybe?) but I'm not familiar 
 with
 them enough to make a recommendation.

 Note that with Cascading and Pig you don't write 'map reduce' you
 write logic and they map it to the various mapper/reducer steps
 automatically.

 Hope this helps,

 Chris


 On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER 
 chivas314...@gmail.com wrote:

 Howdy,

 My application requires 2 distinct processing steps (reducers) to be
 performed on the input data. The first operation generates changes the 
 key
 values and, records that had different keys in step 1 can end up having 
 the
 same key in step 2.

 The heavy lifting of the operation is in step1 and step2 only
 combines records where keys were changed.

 In short the overview is:
 Sequential file - Step 1 - Step 2 - Output.


 To implement this in hadoop, it seems that I need to create a
 separate job for each step.

 Now I assumed, there would some sort of job management under hadoop
 to link Job 1 and 2, but the only thing I could find was related to job
 scheduling and nothing on how to synchronize the input/output of the 
 linked
 jobs.



 The only crude solution that I can think of is to use a temporary
 file under HDFS, but even so I'm not sure if this will work.

 The overview of the process would be:
 Sequential Input (lines) = Job A[Mapper (key1, value1) =
 ChainReducer (key2, value2)] = Temporary file = Job B[Mapper (key2,
 value2) = Reducer (key2, value 3)] = output.

 Is there a better way to pass the output from Job A as input to Job B
 (e.g. using network streams or some built in java classes that don't do
 disk i/o)?



 The temporary file solution will work in a single node configuration,
 but I'm not sure about an MPP config.

 Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3
 or both jobs run on all 4 nodes - will HDFS be able to redistribute
 automagically the records between nodes or does this need to be coded
 somehow?









Re: chaining (the output of) jobs/ reducers

2013-09-17 Thread Adrian CAPDEFIER
I've just seen your email, Vinod. This is the behaviour that I'd expect and
similar to other data integration tools; I will keep an eye out for it as a
long term option.


On Fri, Sep 13, 2013 at 5:26 AM, Vinod Kumar Vavilapalli vino...@apache.org
 wrote:


 Other than the short term solutions that others have proposed, Apache Tez
 solves this exact problem. It can M-M-R-R-R chains, and mult-way mappers
 and reducers, and your own custom processors - all without persisting the
 intermediate outputs to HDFS.

 It works on top of YARN, though the first release of Tez is yet to happen.

 You can learn about it more here: http://tez.incubator.apache.org/

 HTH,
 +Vinod

 On Sep 12, 2013, at 6:36 AM, Adrian CAPDEFIER wrote:

 Howdy,

 My application requires 2 distinct processing steps (reducers) to be
 performed on the input data. The first operation generates changes the key
 values and, records that had different keys in step 1 can end up having the
 same key in step 2.

 The heavy lifting of the operation is in step1 and step2 only combines
 records where keys were changed.

 In short the overview is:
 Sequential file - Step 1 - Step 2 - Output.


 To implement this in hadoop, it seems that I need to create a separate job
 for each step.

 Now I assumed, there would some sort of job management under hadoop to
 link Job 1 and 2, but the only thing I could find was related to job
 scheduling and nothing on how to synchronize the input/output of the linked
 jobs.



 The only crude solution that I can think of is to use a temporary file
 under HDFS, but even so I'm not sure if this will work.

 The overview of the process would be:
 Sequential Input (lines) = Job A[Mapper (key1, value1) = ChainReducer
 (key2, value2)] = Temporary file = Job B[Mapper (key2, value2) = Reducer
 (key2, value 3)] = output.

 Is there a better way to pass the output from Job A as input to Job B
 (e.g. using network streams or some built in java classes that don't do
 disk i/o)?



 The temporary file solution will work in a single node configuration, but
 I'm not sure about an MPP config.

 Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
 both jobs run on all 4 nodes - will HDFS be able to redistribute
 automagically the records between nodes or does this need to be coded
 somehow?



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.


chaining (the output of) jobs/ reducers

2013-09-12 Thread Adrian CAPDEFIER
Howdy,

My application requires 2 distinct processing steps (reducers) to be
performed on the input data. The first operation generates changes the key
values and, records that had different keys in step 1 can end up having the
same key in step 2.

The heavy lifting of the operation is in step1 and step2 only combines
records where keys were changed.

In short the overview is:
Sequential file - Step 1 - Step 2 - Output.


To implement this in hadoop, it seems that I need to create a separate job
for each step.

Now I assumed, there would some sort of job management under hadoop to link
Job 1 and 2, but the only thing I could find was related to job scheduling
and nothing on how to synchronize the input/output of the linked jobs.



The only crude solution that I can think of is to use a temporary file
under HDFS, but even so I'm not sure if this will work.

The overview of the process would be:
Sequential Input (lines) = Job A[Mapper (key1, value1) = ChainReducer
(key2, value2)] = Temporary file = Job B[Mapper (key2, value2) = Reducer
(key2, value 3)] = output.

Is there a better way to pass the output from Job A as input to Job B (e.g.
using network streams or some built in java classes that don't do disk
i/o)?



The temporary file solution will work in a single node configuration, but
I'm not sure about an MPP config.

Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
both jobs run on all 4 nodes - will HDFS be able to redistribute
automagically the records between nodes or does this need to be coded
somehow?


Re: chaining (the output of) jobs/ reducers

2013-09-12 Thread Adrian CAPDEFIER
Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
prefer to keep, if possible, everything as close to the hadoop libraries.

I am sure I am overlooking something basic as repartitioning is a fairly
common operation in MPP environments.


On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin curtin.ch...@gmail.comwrote:

 If you want to stay in Java look at Cascading. Pig is also helpful. I
 think there are other (Spring integration maybe?) but I'm not familiar with
 them enough to make a recommendation.

 Note that with Cascading and Pig you don't write 'map reduce' you write
 logic and they map it to the various mapper/reducer steps automatically.

 Hope this helps,

 Chris


 On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER 
 chivas314...@gmail.comwrote:

 Howdy,

 My application requires 2 distinct processing steps (reducers) to be
 performed on the input data. The first operation generates changes the key
 values and, records that had different keys in step 1 can end up having the
 same key in step 2.

 The heavy lifting of the operation is in step1 and step2 only combines
 records where keys were changed.

 In short the overview is:
 Sequential file - Step 1 - Step 2 - Output.


 To implement this in hadoop, it seems that I need to create a separate
 job for each step.

 Now I assumed, there would some sort of job management under hadoop to
 link Job 1 and 2, but the only thing I could find was related to job
 scheduling and nothing on how to synchronize the input/output of the linked
 jobs.



 The only crude solution that I can think of is to use a temporary file
 under HDFS, but even so I'm not sure if this will work.

 The overview of the process would be:
 Sequential Input (lines) = Job A[Mapper (key1, value1) = ChainReducer
 (key2, value2)] = Temporary file = Job B[Mapper (key2, value2) = Reducer
 (key2, value 3)] = output.

 Is there a better way to pass the output from Job A as input to Job B
 (e.g. using network streams or some built in java classes that don't do
 disk i/o)?



 The temporary file solution will work in a single node configuration, but
 I'm not sure about an MPP config.

 Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
 both jobs run on all 4 nodes - will HDFS be able to redistribute
 automagically the records between nodes or does this need to be coded
 somehow?





Re: chaining (the output of) jobs/ reducers

2013-09-12 Thread Adrian CAPDEFIER
Thanks Bryan.

Yes, I am using hadoop + hdfs.

If I understand your point, hadoop tries to start the mapping processes on
nodes where the data is local and if that's not possible, then it is hdfs
that replicates the data to the mapper nodes?

I expected to have to set up this in the code and I completely ignored
HDFS; I guess it's a case of not seeing the forest from all the trees!


On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault bbeaudrea...@hubspot.com
 wrote:

 It really comes down to the following:

 In Job A set mapred.output.dir to some directory X.
 In Job B set mapred.input.dir to the same directory X.

 For Job A, do context.write() as normally, and each reducer will create an
 output file in mapred.output.dir.  Then in Job B each of those will
 correspond to a mapper.

 Of course you need to make sure your input and output formats, as well as
 input and output keys/values, match up between the two jobs as well.

 If you are using HDFS, which it seems you are, the directories specified
 can be HDFS directories.  In that case, with a replication factor of 3,
 each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
 the work to ensure that the mappers in the second job do as good a job as
 possible to be data or rack-local.


 On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER chivas314...@gmail.com
  wrote:

 Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
 prefer to keep, if possible, everything as close to the hadoop libraries.

 I am sure I am overlooking something basic as repartitioning is a fairly
 common operation in MPP environments.


 On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin curtin.ch...@gmail.comwrote:

 If you want to stay in Java look at Cascading. Pig is also helpful. I
 think there are other (Spring integration maybe?) but I'm not familiar with
 them enough to make a recommendation.

 Note that with Cascading and Pig you don't write 'map reduce' you write
 logic and they map it to the various mapper/reducer steps automatically.

 Hope this helps,

 Chris


 On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER 
 chivas314...@gmail.com wrote:

 Howdy,

 My application requires 2 distinct processing steps (reducers) to be
 performed on the input data. The first operation generates changes the key
 values and, records that had different keys in step 1 can end up having the
 same key in step 2.

 The heavy lifting of the operation is in step1 and step2 only combines
 records where keys were changed.

 In short the overview is:
 Sequential file - Step 1 - Step 2 - Output.


 To implement this in hadoop, it seems that I need to create a separate
 job for each step.

 Now I assumed, there would some sort of job management under hadoop to
 link Job 1 and 2, but the only thing I could find was related to job
 scheduling and nothing on how to synchronize the input/output of the linked
 jobs.



 The only crude solution that I can think of is to use a temporary file
 under HDFS, but even so I'm not sure if this will work.

 The overview of the process would be:
 Sequential Input (lines) = Job A[Mapper (key1, value1) = ChainReducer
 (key2, value2)] = Temporary file = Job B[Mapper (key2, value2) = Reducer
 (key2, value 3)] = output.

 Is there a better way to pass the output from Job A as input to Job B
 (e.g. using network streams or some built in java classes that don't do
 disk i/o)?



 The temporary file solution will work in a single node configuration,
 but I'm not sure about an MPP config.

 Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3
 or both jobs run on all 4 nodes - will HDFS be able to redistribute
 automagically the records between nodes or does this need to be coded
 somehow?







Re: Job config before read fields

2013-09-09 Thread Adrian CAPDEFIER
Hi Shahab,

Sorry about the late reply, a personal matter came up and it took most of
my time. Thank you for your replies.

The solution I chose was to temporarily transfer the metadata along with
the data and then restore it on the reduce nodes. This works from a
functional perspective as long as there are no performance requirements and
it will have to do for now.

The permanent solution will likely involve tweaking hadoop, but that is a
different kettle of fish.


On Sun, Sep 1, 2013 at 12:48 AM, Shahab Yunus shahab.yu...@gmail.comwrote:

 Personally, I don't know a way to access job configuration parameters in
 custom implementation of Writables ( at least not an elegant and
 appropriate one. Of course hacks of various kinds be done.) Maybe experts
 can chime in?

 One idea that I thought about was to use MapWritable (if you have not
 explored it already.) You can encode the 'custom metadata' for you 'data'
 as one byte symbols and move your data in the M/R flow as a map. Then while
 deserialization you will have the type (or your 'custom metadata') in the
 key part of the map and the value would be you actual data. This aligns
 with the efficient approach that is used natively in Hadoop for
 Strings/Text i.e. compact metadata (though I agree that you are not taking
 advantage of the other aspect of non-dependence between metadata and the
 data it defines.)

 Take a look at that:
 Page 96 of the Definitive Guide:

 http://books.google.com/books?id=Nff49D7vnJcCpg=PA96lpg=PA96dq=mapwritable+in+hadoopsource=blots=IiixYu7vXusig=4V6H7cY-MrNT7Rzs3WlODsDOoP4hl=ensa=Xei=aX4iUp2YGoaosASs_YCACQsqi=2ved=0CFUQ6AEwBA#v=onepageq=mapwritable%20in%20hadoopf=false

 and then this:

 http://www.chrisstucchio.com/blog/2011/mapwritable_sometimes_a_performance_hog.html

 and add your own custom types here (note that you are restricted by size
 of byte):

 http://hadoop.sourcearchive.com/documentation/0.20.2plus-pdfsg1-1/AbstractMapWritable_8java-source.html

 Regards,
 Shahab


 On Sat, Aug 31, 2013 at 5:38 AM, Adrian CAPDEFIER 
 chivas314...@gmail.comwrote:

 Thank you for your help Shahab.

 I guess I wasn't being too clear. My logic is that I use a custom type as
 key and in order to deserialize it on the compute nodes, I need an extra
 piece of information (also a custom type).

 To use an analogy, a Text is serialized by writing the length of the
 string as a number and then the bytes that compose the actual string. When
 it is deserialized, the number informs the reader when to stop reading the
 string. This number is varies from string to string and it is compact so it
 makes sense to serialize it with the string.

 My use case is similar to it. I have a complex type (let's call this
 data), and in order to deserialize it, I need another complex type (let's
 call this second type metadata). The metadata is not closely tied to the
 data (i.e. if the data value changes, the metadata does not) and the
 metadata size is quite large.

 I ruled out a couple of options, but please let me know if you think I
 did so for the wrong reasons:
 1. I could serialize each data value with it's own metadata value, but
 since the data value count is in the +tens of millions and the metadata
 value distinct count can be up to one hundred, it would waste resources in
 the system.
 2. I could serialize the metadata and then the data as a collection
 property of the metadata. This would be an elegant solution code-wise, but
 then all the data would have to be read and kept in memory as a massive
 object before any reduce operations can happen. I wasn't able to find any
 info on this online so this is just a guess from peeking at the hadoop code.

 My solution was to serialize the data with a hash of the metadata and
 separately serialize the metadata and its hash in the job configuration (as
 key/value pairs). For this to work, I would need to be able to deserialize
 the metadata on the reduce node before the data is deserialized in the
 readFields() method.

 I think that for that to happen I need to hook into the code somewhere
 where a context or job configuration is used (before readFields()), but I'm
 stumped as to where that is.

  Cheers,
 Adi


 On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus shahab.yu...@gmail.comwrote:

 What I meant was that you might have to split or redesign your logic or
 your usecase (which we don't know about)?

 Regards,
 Shahab


 On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER 
 chivas314...@gmail.com wrote:

 But how would the comparator have access to the job config?


 On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus 
 shahab.yu...@gmail.comwrote:

 I think you have to override/extend the Comparator to achieve that,
 something like what is done in Secondary Sort?

 Regards,
 Shahab


 On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER 
 chivas314...@gmail.com wrote:

 Howdy,

 I apologise for the lack of code in this message, but the code is
 fairly convoluted and it would obscure my problem

Re: Job config before read fields

2013-08-31 Thread Adrian CAPDEFIER
Thank you for your help Shahab.

I guess I wasn't being too clear. My logic is that I use a custom type as
key and in order to deserialize it on the compute nodes, I need an extra
piece of information (also a custom type).

To use an analogy, a Text is serialized by writing the length of the string
as a number and then the bytes that compose the actual string. When it is
deserialized, the number informs the reader when to stop reading the
string. This number is varies from string to string and it is compact so it
makes sense to serialize it with the string.

My use case is similar to it. I have a complex type (let's call this data),
and in order to deserialize it, I need another complex type (let's call
this second type metadata). The metadata is not closely tied to the data
(i.e. if the data value changes, the metadata does not) and the metadata
size is quite large.

I ruled out a couple of options, but please let me know if you think I did
so for the wrong reasons:
1. I could serialize each data value with it's own metadata value, but
since the data value count is in the +tens of millions and the metadata
value distinct count can be up to one hundred, it would waste resources in
the system.
2. I could serialize the metadata and then the data as a collection
property of the metadata. This would be an elegant solution code-wise, but
then all the data would have to be read and kept in memory as a massive
object before any reduce operations can happen. I wasn't able to find any
info on this online so this is just a guess from peeking at the hadoop code.

My solution was to serialize the data with a hash of the metadata and
separately serialize the metadata and its hash in the job configuration (as
key/value pairs). For this to work, I would need to be able to deserialize
the metadata on the reduce node before the data is deserialized in the
readFields() method.

I think that for that to happen I need to hook into the code somewhere
where a context or job configuration is used (before readFields()), but I'm
stumped as to where that is.

Cheers,
Adi


On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus shahab.yu...@gmail.comwrote:

 What I meant was that you might have to split or redesign your logic or
 your usecase (which we don't know about)?

 Regards,
 Shahab


 On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER chivas314...@gmail.com
  wrote:

 But how would the comparator have access to the job config?


 On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus shahab.yu...@gmail.comwrote:

 I think you have to override/extend the Comparator to achieve that,
 something like what is done in Secondary Sort?

 Regards,
 Shahab


 On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER 
 chivas314...@gmail.com wrote:

 Howdy,

 I apologise for the lack of code in this message, but the code is
 fairly convoluted and it would obscure my problem. That being said, I can
 put together some sample code if really needed.

 I am trying to pass some metadata between the map  reduce steps. This
 metadata is read and generated in the map step and stored in the job
 config. It also needs to be recreated on the reduce node before the key/
 value fields can be read in the readFields function.

 I had assumed that I would be able to override the Reducer.setup()
 function and that would be it, but apparently the readFields function is
 called before the Reducer.setup() function.

 My question is what is any (the best) place on the reduce node where I
 can access the job configuration/ context before the readFields function is
 called?

 This is the stack trace:

 at
 org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:)
 at
 org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
 at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
 at
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
 at
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
 at org.apache.hadoop.mapred.Child.main(Child.java:249)







Job config before read fields

2013-08-30 Thread Adrian CAPDEFIER
Howdy,

I apologise for the lack of code in this message, but the code is fairly
convoluted and it would obscure my problem. That being said, I can put
together some sample code if really needed.

I am trying to pass some metadata between the map  reduce steps. This
metadata is read and generated in the map step and stored in the job
config. It also needs to be recreated on the reduce node before the key/
value fields can be read in the readFields function.

I had assumed that I would be able to override the Reducer.setup() function
and that would be it, but apparently the readFields function is called
before the Reducer.setup() function.

My question is what is any (the best) place on the reduce node where I can
access the job configuration/ context before the readFields function is
called?

This is the stack trace:

at
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:)
at org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
at org.apache.hadoop.mapred.Child.main(Child.java:249)


Re: Job config before read fields

2013-08-30 Thread Adrian CAPDEFIER
But how would the comparator have access to the job config?


On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus shahab.yu...@gmail.comwrote:

 I think you have to override/extend the Comparator to achieve that,
 something like what is done in Secondary Sort?

 Regards,
 Shahab


 On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER 
 chivas314...@gmail.comwrote:

 Howdy,

 I apologise for the lack of code in this message, but the code is fairly
 convoluted and it would obscure my problem. That being said, I can put
 together some sample code if really needed.

 I am trying to pass some metadata between the map  reduce steps. This
 metadata is read and generated in the map step and stored in the job
 config. It also needs to be recreated on the reduce node before the key/
 value fields can be read in the readFields function.

 I had assumed that I would be able to override the Reducer.setup()
 function and that would be it, but apparently the readFields function is
 called before the Reducer.setup() function.

 My question is what is any (the best) place on the reduce node where I
 can access the job configuration/ context before the readFields function is
 called?

 This is the stack trace:

 at
 org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:)
 at
 org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
 at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
 at
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
 at org.apache.hadoop.mapred.Child.main(Child.java:249)