date:20120305

fairscheduler : group.name | Please edit patch to work for 0.20.205

2012-03-05 Thread Austin Chungath

Can someone have a look at the patch MAPREDUCE-2457 and see if it can be
modified to work for 0.20.205?
I am very new to java and have no idea what's going on in that patch. If
you have any pointers for me, I will see if I can do it on my own.

Thanks,
Austin

On Fri, Mar 2, 2012 at 7:15 PM, Austin Chungath austi...@gmail.com wrote:

 I tried the patch MAPREDUCE-2457 but it didn't work for my hadoop 0.20.205.
 Are you sure this patch will work for 0.20.205?
 According to the description it says that the patch works for 0.21 and
 0.22 and it says that 0.20 supports group.name without this patch...

 So does this patch also apply to 0.20.205?

 Thanks,
 Austin

  On Thu, Mar 1, 2012 at 11:24 PM, Harsh J ha...@cloudera.com wrote:

 The group.name scheduler support was introduced in
 https://issues.apache.org/jira/browse/HADOOP-3892 but may have been
 broken by the security changes present in 0.20.205. You'll need the
 fix presented in  https://issues.apache.org/jira/browse/MAPREDUCE-2457
 to have group.name support.

 On Thu, Mar 1, 2012 at 6:42 PM, Austin Chungath austi...@gmail.com
 wrote:
   I am running fair scheduler on hadoop 0.20.205.0
 
  http://hadoop.apache.org/common/docs/r0.20.205.0/fair_scheduler.html
  The above page talks about the following property
 
  *mapred.fairscheduler.poolnameproperty*
  **
  which I can set to *group.name*
  The default is user.name and when a user submits a job the fair
 scheduler
  assigns each user's job to a pool which has the name of the user.
  I am trying to change it to group.name so that the job is submitted to
 a
  pool which has the name of the user's linux group. Thus all jobs from
 any
  user from a specific group go to the same pool instead of an individual
  pool for every user.
  But *group.name* doesn't seem to work, has anyone tried this before?
 
  *user.name* and *mapred.job.queue.name* works. Is group.name supported
 in
   0.20.205.0 because I don't see it mentioned in the docs?
 
  Thanks,
  Austin



 --
 Harsh J

Re: Setting up Hadoop single node setup on Mac OS X

2012-03-05 Thread John Armstrong


On 02/27/2012 11:53 AM, W.P. McNeill wrote:

You don't need any virtualization. Mac OS X is Linux and runs Hadoop as is.



Nitpick: OS X is NEXTSTEP based on Mach, which is a different 
POSIX-compliant system from Linux.

Re: AWS MapReduce

2012-03-05 Thread John Conwell

AWS MapReduce (EMR) does not use S3 for its HDFS persistance.  If it did
your S3 billing would be massive :)  EMR reads all input jar files and
input data from S3, but it copies these files down to its local disk.  It
then does starts the MR process, doing all HDFS reads and writes to the
local disks.  At the end of the MR job, it copies the MR job output and all
process logs to S3, and then tears down the VM instances.

You can see this for yourself if you spin up a small EMR cluster, but turn
off the configuration flag that kills the VMs at the end if the MR job.
 Then look at the hadoop configuration files to see how hadoop is
configured.

I really like EMR.  Amazon  has done a lot of work to optimize the hadoop
configurations and VM instance AMIs to execute MR jobs fairly efficiently
on a VM cluster.  I had to do a lot of (expensive) trial and error work to
figure out an optimal hadoop / VM configuration to run our MR jobs without
crashing / timing out the jobs.  The only reason we didnt standardize on
EMR was that it strongly bound your code base / process to using EMR for
hadoop processing, vs a flexible infrastructure that could use a local
cluster or cluster on a different cloud provider.


On Sun, Mar 4, 2012 at 8:51 AM, Mohit Anchlia mohitanch...@gmail.comwrote:

 As far as I see in the docs it looks like you could also use hdfs instead
 of s3. But what I am not sure is if these are local disks or EBS.

 On Sun, Mar 4, 2012 at 2:27 AM, Hannes Carl Meyer 
 hannesc...@googlemail.com
  wrote:

  Hi,
 
  yes, its loaded from S3. Imho is Amazon AWS Map-Reduce pretty slow.
  The setup is done pretty fast and there are some configuration parameters
  you can bypass - for example blocksizes etc. - but in the end imho
 setting
  up ec2 instances by copying images is the better alternative.
 
  Kind Regards
 
  Hannes
 
  On Sun, Mar 4, 2012 at 2:31 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
 
   I think found answer to this question. However, it's still not clear if
   HDFS is on local disk or EBS volumes. Does anyone know?
  
   On Sat, Mar 3, 2012 at 3:54 PM, Mohit Anchlia mohitanch...@gmail.com
   wrote:
  
Just want to check  how many are using AWS mapreduce and understand
 the
pros and cons of Amazon's MapReduce machines? Is it true that these
 map
reduce machines are really reading and writing from S3 instead of
 local
disks? Has anyone found issues with Amazon MapReduce and how does it
compare with using MapReduce on local attached disks compared to
 using
   S3.
  
 
  ---
  www.informera.de
  Hadoop  Big Data Services
 




-- 

Thanks,
John C

Re: AWS MapReduce

2012-03-05 Thread Mohit Anchlia

On Mon, Mar 5, 2012 at 7:40 AM, John Conwell j...@iamjohn.me wrote:

 AWS MapReduce (EMR) does not use S3 for its HDFS persistance.  If it did
 your S3 billing would be massive :)  EMR reads all input jar files and
 input data from S3, but it copies these files down to its local disk.  It
 then does starts the MR process, doing all HDFS reads and writes to the
 local disks.  At the end of the MR job, it copies the MR job output and all
 process logs to S3, and then tears down the VM instances.

 You can see this for yourself if you spin up a small EMR cluster, but turn
 off the configuration flag that kills the VMs at the end if the MR job.
  Then look at the hadoop configuration files to see how hadoop is
 configured.

 I really like EMR.  Amazon  has done a lot of work to optimize the hadoop
 configurations and VM instance AMIs to execute MR jobs fairly efficiently
 on a VM cluster.  I had to do a lot of (expensive) trial and error work to
 figure out an optimal hadoop / VM configuration to run our MR jobs without
 crashing / timing out the jobs.  The only reason we didnt standardize on
 EMR was that it strongly bound your code base / process to using EMR for
 hadoop processing, vs a flexible infrastructure that could use a local
 cluster or cluster on a different cloud provider.

 Thanks for your input. I am assuming HDFS is created on ephemerial disks
and not EBS. Also, is it possible to share some of your findings?


 On Sun, Mar 4, 2012 at 8:51 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  As far as I see in the docs it looks like you could also use hdfs instead
  of s3. But what I am not sure is if these are local disks or EBS.
 
  On Sun, Mar 4, 2012 at 2:27 AM, Hannes Carl Meyer 
  hannesc...@googlemail.com
   wrote:
 
   Hi,
  
   yes, its loaded from S3. Imho is Amazon AWS Map-Reduce pretty slow.
   The setup is done pretty fast and there are some configuration
 parameters
   you can bypass - for example blocksizes etc. - but in the end imho
  setting
   up ec2 instances by copying images is the better alternative.
  
   Kind Regards
  
   Hannes
  
   On Sun, Mar 4, 2012 at 2:31 AM, Mohit Anchlia mohitanch...@gmail.com
   wrote:
  
I think found answer to this question. However, it's still not clear
 if
HDFS is on local disk or EBS volumes. Does anyone know?
   
On Sat, Mar 3, 2012 at 3:54 PM, Mohit Anchlia 
 mohitanch...@gmail.com
wrote:
   
 Just want to check  how many are using AWS mapreduce and understand
  the
 pros and cons of Amazon's MapReduce machines? Is it true that these
  map
 reduce machines are really reading and writing from S3 instead of
  local
 disks? Has anyone found issues with Amazon MapReduce and how does
 it
 compare with using MapReduce on local attached disks compared to
  using
S3.
   
  
   ---
   www.informera.de
   Hadoop  Big Data Services
  
 



 --

 Thanks,
 John C

Re: Custom Seq File Loader: ClassNotFoundException

2012-03-05 Thread Mark question

Hi Madhu, it has the following line:

TermDocFreqArrayWritable () {}

but I'll try it with public access in case it's been called outside of my
package.

Thank you,
Mark

On Sun, Mar 4, 2012 at 9:55 PM, madhu phatak phatak@gmail.com wrote:

 Hi,
  Please make sure that your CustomWritable has a default constructor.

 On Sat, Mar 3, 2012 at 4:56 AM, Mark question markq2...@gmail.com wrote:

  Hello,
 
I'm trying to debug my code through eclipse, which worked fine with
  given Hadoop applications (eg. wordcount), but as soon as I run it on my
  application with my custom sequence input file/types, I get:
  Java.lang.runtimeException.java.ioException (Writable name can't load
  class)
  SequenceFile$Reader.getValeClass(Sequence File.class)
 
  because my valueClass is customed. In other words, how can I add/build my
  CustomWritable class to be with hadoop LongWritable,IntegerWritable 
  etc.
 
  Did anyone used eclipse?
 
  Mark
 



 --
 Join me at http://hadoopworkshop.eventbrite.com/

Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-05 Thread Russell Jurney

Streaming is good for simulation. Long running map-only processes, where pig 
doesn't really help and it is simple to fire off a streaming process.  You do 
have to set some options so they can take a long time to return/return counters.

Russell Jurney http://datasyndrome.com

On Mar 5, 2012, at 12:38 PM, Eli Finkelshteyn iefin...@gmail.com wrote:

 I'm really interested in this as well. I have trouble seeing a really good 
 use case for streaming map-reduce. Is there something I can do in streaming 
 that I can't do in Pig? If I want to re-use previously made Python functions 
 from my code base, I can do that in Pig as much as Streaming, and from what 
 I've experienced thus far, Python streaming seems to go slower than or at the 
 same speed as Pig, so why would I want to write a whole lot of 
 more-difficult-to-read mappers and reducers when I can do equally fast 
 performance-wise, shorter, and clearer code in Pig? Maybe it's obvious, but 
 currently I just can't think of the right use case.
 
 Eli
 
 On 3/2/12 9:21 AM, Subir S wrote:
 On Fri, Mar 2, 2012 at 12:38 PM, Harsh Jha...@cloudera.com  wrote:
 
 On Fri, Mar 2, 2012 at 10:18 AM, Subir Ssubir.sasiku...@gmail.com
 wrote:
 Hello Folks,
 
 Are there any pointers to such comparisons between Apache Pig and Hadoop
 Streaming Map Reduce jobs?
 I do not see why you seek to compare these two. Pig offers a language
 that lets you write data-flow operations and runs these statements as
 a series of MR jobs for you automatically (Making it a great tool to
 use to get data processing done really quick, without bothering with
 code), while streaming is something you use to write non-Java, simple
 MR jobs. Both have their own purposes.
 
 Basically we are comparing these two to see the benefits and how much they
 help in improving the productive coding time, without jeopardizing the
 performance of MR jobs.
 
 
 Also there was a claim in our company that Pig performs better than Map
 Reduce jobs? Is this true? Are there any such benchmarks available
 Pig _runs_ MR jobs. It does do job design (and some data)
 optimizations based on your queries, which is what may give it an edge
 over designing elaborate flows of plain MR jobs with tools like
 Oozie/JobControl (Which takes more time to do). But regardless, Pig
 only makes it easy doing the same thing with Pig Latin statements for
 you.
 
 I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become
 pretty slow with lot of joins, which we can achieve faster with writing raw
 MR jobs. So with that context was trying to see how Pig runs MR jobs. Like
 for example what kind of projects should consider Pig. Say when we have a
 lot of Joins, which writing with plain MR jobs takes time. Thoughts?
 
 Thank you Harsh for your comments. They are helpful!
 
 
 --
 Harsh J

Re: Custom Seq File Loader: ClassNotFoundException

2012-03-05 Thread Mark question

Unfortunately, public didn't change my error ... Any other ideas? Has
anyone ran Hadoop on eclipse with custom sequence inputs ?

Thank you,
Mark

On Mon, Mar 5, 2012 at 9:58 AM, Mark question markq2...@gmail.com wrote:

 Hi Madhu, it has the following line:

 TermDocFreqArrayWritable () {}

 but I'll try it with public access in case it's been called outside of
 my package.

 Thank you,
 Mark


 On Sun, Mar 4, 2012 at 9:55 PM, madhu phatak phatak@gmail.com wrote:

 Hi,
  Please make sure that your CustomWritable has a default constructor.

 On Sat, Mar 3, 2012 at 4:56 AM, Mark question markq2...@gmail.com
 wrote:

  Hello,
 
I'm trying to debug my code through eclipse, which worked fine with
  given Hadoop applications (eg. wordcount), but as soon as I run it on my
  application with my custom sequence input file/types, I get:
  Java.lang.runtimeException.java.ioException (Writable name can't load
  class)
  SequenceFile$Reader.getValeClass(Sequence File.class)
 
  because my valueClass is customed. In other words, how can I add/build
 my
  CustomWritable class to be with hadoop LongWritable,IntegerWritable 
  etc.
 
  Did anyone used eclipse?
 
  Mark
 



 --
 Join me at http://hadoopworkshop.eventbrite.com/

Re: OutOfMemoryError: unable to create new native thread

2012-03-05 Thread Clay Chiang

Hi Rohini,

   The similar problem was just encountered for me yesterday. But for my
situation, the max process num (ulimit -u) is set to 1024, which is too
small. And when i increase it to 100, the problem gone.  But u said
Ulimit on the machine is set to unlimited,  i'm not sure this will help
or not :)

   And also check about `cat /proc/sys/kernel/threads-max', this seems to
be a system-wide setting for total number of threads.


On Tue, Mar 6, 2012 at 4:30 AM, Rohini U rohin...@gmail.com wrote:

 Hi All,

 I am running a map reduce job that uses around 120 MB of data and I get
 this out of memory error.  Ulimit on the machine is set to unlimited.  Any
 ideas on how to fix this?
 The stack trace is as given below:


 Exception in thread main org.apache.hadoop.ipc.RemoteException:
 java.io.IOException: java.lang.OutOfMemoryError: unable to create new
 native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:597)
at
 org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.kill(JvmManager.java:553)
at
 org.apache.hadoop.mapred.JvmManager$JvmManagerForType.killJvmRunner(JvmManager.java:317)
at
 org.apache.hadoop.mapred.JvmManager$JvmManagerForType.killJvm(JvmManager.java:297)
at
 org.apache.hadoop.mapred.JvmManager$JvmManagerForType.taskKilled(JvmManager.java:289)
at
 org.apache.hadoop.mapred.JvmManager.taskKilled(JvmManager.java:158)
at org.apache.hadoop.mapred.TaskRunner.kill(TaskRunner.java:782)
at
 org.apache.hadoop.mapred.TaskTracker$TaskInProgress.kill(TaskTracker.java:2938)
at
 org.apache.hadoop.mapred.TaskTracker$TaskInProgress.jobHasFinished(TaskTracker.java:2910)
at
 org.apache.hadoop.mapred.TaskTracker.purgeTask(TaskTracker.java:1974)
at
 org.apache.hadoop.mapred.TaskTracker.fatalError(TaskTracker.java:3327)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)

at org.apache.hadoop.ipc.Client.call(Client.java:1107)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy0.fatalError(Unknown Source)
at org.apache.hadoop.mapred.Child.main(Child.java:325)



 Thanks
 -Rohini




-- 
Kindest Regards,
Clay Chiang

Re: Java Heap space error

2012-03-05 Thread Mohit Anchlia

All I see in the logs is:


2012-03-05 17:26:36,889 FATAL org.apache.hadoop.mapred.TaskTracker: Task:
attempt_201203051722_0001_m_30_1 - Killed : Java heap space

Looks like task tracker is killing the tasks. Not sure why. I increased
heap from 512 to 1G and still it fails.


On Mon, Mar 5, 2012 at 5:03 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I currently have java.opts.mapred set to 512MB and I am getting heap space
 errors. How should I go about debugging heap space issues?

Re: Java Heap space error

2012-03-05 Thread Mohit Anchlia

Sorry for multiple emails. I did find:


2012-03-05 17:26:35,636 INFO
org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call-
Usage threshold init = 715849728(699072K) used = 575921696(562423K)
committed = 715849728(699072K) max = 715849728(699072K)

2012-03-05 17:26:35,719 INFO
org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of
7816154 bytes from 1 objects. init = 715849728(699072K) used =
575921696(562423K) committed = 715849728(699072K) max = 715849728(699072K)

2012-03-05 17:26:36,881 INFO
org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call
- Collection threshold init = 715849728(699072K) used = 358720384(350312K)
committed = 715849728(699072K) max = 715849728(699072K)

2012-03-05 17:26:36,885 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1

2012-03-05 17:26:36,888 FATAL org.apache.hadoop.mapred.Child: Error running
child : java.lang.OutOfMemoryError: Java heap space

at java.nio.HeapCharBuffer.init(HeapCharBuffer.java:39)

at java.nio.CharBuffer.allocate(CharBuffer.java:312)

at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:760)

at org.apache.hadoop.io.Text.decode(Text.java:350)

at org.apache.hadoop.io.Text.decode(Text.java:327)

at org.apache.hadoop.io.Text.toString(Text.java:254)

at
org.apache.pig.piggybank.storage.SequenceFileLoader.translateWritableToPigDataType(SequenceFileLoader.java:105)

at
org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:139)

at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)

at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)

at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)

at org.apache.hadoop.mapred.Child$4.run(Child.java:270)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:396)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)

at org.apache.hadoop.mapred.Child.main(Child.java:264)


On Mon, Mar 5, 2012 at 5:46 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 All I see in the logs is:


 2012-03-05 17:26:36,889 FATAL org.apache.hadoop.mapred.TaskTracker: Task:
 attempt_201203051722_0001_m_30_1 - Killed : Java heap space

 Looks like task tracker is killing the tasks. Not sure why. I increased
 heap from 512 to 1G and still it fails.


 On Mon, Mar 5, 2012 at 5:03 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I currently have java.opts.mapred set to 512MB and I am getting heap
 space errors. How should I go about debugging heap space issues?

hadoop 1.0 / HOD or CloneZilla?

2012-03-05 Thread Masoud


Hi all,

I have experience with hadoop 0.20.204 on 3 machines cluster as pilot, 
now im trying to setup real cluster on 32 linux machines.

I have some question:

1. is hadoop 1.0 stable?? in hadoop site this version is indicated as
   beta release

2. as you know installing and setting up hadoop in all 32 machines
   separately in not good idea, so what can i do?
1. using hadoop on demand (HOD)?
2. or using OS image replicate tools same as clozeZilla? i think
   this method is better because in addition to hadoop I can clone
   same other settings such as SSH or Samba in all machines.

Let me know your idea,

B.S,
Masoud.

Re: why does my mapper class reads my input file twice?

2012-03-05 Thread Harsh J

Its your use of the mapred.input.dir property, which is a reserved
name in the framework (its what FileInputFormat uses).

You have a config you extract path from:
Path input = new Path(conf.get(mapred.input.dir));

Then you do:
FileInputFormat.addInputPath(job, input);

Which internally, simply appends a path to a config prop called
mapred.input.dir. Hence your job gets launched with two input files
(the very same) - one added by default Tool-provided configuration
(cause of your -Dmapred.input.dir) and the other added by you.

Fix the input path line to use a different config:
Path input = new Path(conf.get(input.path));

And run job as:
hadoop jar dummy-0.1.jar dummy.MyJob -Dinput.path=data/dummy.txt
-Dmapred.output.dir=result

On Tue, Mar 6, 2012 at 9:03 AM, Jane Wayne jane.wayne2...@gmail.com wrote:
 i have code that reads in a text file. i notice that each line in the text
 file is somehow being read twice. why is this happening?

 my mapper class looks like the following:

 public class MyMapper extends MapperLongWritable, Text, LongWritable,
 Text {

 private static final Log _log = LogFactory.getLog(MyMapper.class);
  @Override
 public void map(LongWritable key, Text value, Context context) throws
 IOException, InterruptedException {
 String s = (new
 StringBuilder()).append(value.toString()).append(m).toString();
 context.write(key, new Text(s));
 _log.debug(key.toString() +  =  + s);
 }
 }

 my reducer class looks like the following:

 public class MyReducer extends ReducerLongWritable, Text, LongWritable,
 Text {

 private static final Log _log = LogFactory.getLog(MyReducer.class);
  @Override
 public void reduce(LongWritable key, IterableText values, Context
 context) throws IOException, InterruptedException {
 for(IteratorText it = values.iterator(); it.hasNext();) {
 Text txt = it.next();
 String s = (new
 StringBuilder()).append(txt.toString()).append(r).toString();
 context.write(key, new Text(s));
 _log.debug(key.toString() +  =  + s);
 }
 }
 }

 my job class looks like the following:

 public class MyJob extends Configured implements Tool {

 public static void main(String[] args) throws Exception {
 ToolRunner.run(new Configuration(), new MyJob(), args);
 }

 @Override
 public int run(String[] args) throws Exception {
 Configuration conf = getConf();
 Path input = new Path(conf.get(mapred.input.dir));
    Path output = new Path(conf.get(mapred.output.dir));

    Job job = new Job(conf, dummy job);
    job.setMapOutputKeyClass(LongWritable.class);
    job.setMapOutputValueClass(Text.class);
    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);

    job.setMapperClass(MyMapper.class);
    job.setReducerClass(MyReducer.class);

    FileInputFormat.addInputPath(job, input);
    FileOutputFormat.setOutputPath(job, output);

    job.setJarByClass(MyJob.class);

    return job.waitForCompletion(true) ? 0 : 1;
 }
 }

 the text file that i am trying to read in looks like the following. as you
 can see, there are 9 lines.

 T, T
 T, T
 T, T
 F, F
 F, F
 F, F
 F, F
 T, F
 F, T

 the output file that i get after my Job runs looks like the following. as
 you can see, there are 18 lines. each key is emitted twice from the mapper
 to the reducer.

 0   T, Tmr
 0   T, Tmr
 6   T, Tmr
 6   T, Tmr
 12  T, Tmr
 12  T, Tmr
 18  F, Fmr
 18  F, Fmr
 24  F, Fmr
 24  F, Fmr
 30  F, Fmr
 30  F, Fmr
 36  F, Fmr
 36  F, Fmr
 42  T, Fmr
 42  T, Fmr
 48  F, Tmr
 48  F, Tmr

 the way i execute my Job is as follows (cygwin + hadoop 0.20.2).

 hadoop jar dummy-0.1.jar dummy.MyJob -Dmapred.input.dir=data/dummy.txt
 -Dmapred.output.dir=result

 originally, this happened when i read in a sequence file, but even for a
 text file, this problem is still happening. is it the way i have setup my
 Job?



-- 
Harsh J

fairscheduler : group.name | Please edit patch to work for 0.20.205

Re: Setting up Hadoop single node setup on Mac OS X

Re: AWS MapReduce

Re: AWS MapReduce

Re: Custom Seq File Loader: ClassNotFoundException

Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

Re: Custom Seq File Loader: ClassNotFoundException

Re: OutOfMemoryError: unable to create new native thread

Re: Java Heap space error

Re: Java Heap space error

hadoop 1.0 / HOD or CloneZilla?

Re: why does my mapper class reads my input file twice?

12 matches

Site Navigation

Mail list logo

Footer information