Re: Linear slowdown producing streaming output

2011-02-05 Thread Ken Goodhope
Mappers don't just write data, they also sort it. Different parameters
like the size of the sort buffer can have an impact on these kinds of
metrics.

On Friday, February 4, 2011, Keith Wiley kwi...@keithwiley.com wrote:
 I noticed that it takes much longer to write 55 MBs to a streaming output 
 than it takes to write 12 MBs (much more than 4-5X longer), so I broke the 
 output up, writing 1 MB at a time and discovered a perfectly linear slowdown. 
  Bottom line, the more data I have already written to stdout from a streaming 
 task, the longer it takes to write the next block of data.  I have no idea if 
 this is intrinsic to producing stdout from any Unix process (I've never heard 
 of such a thing) or if this is a Hadoop issue.  Does anyone have any idea 
 what's going on here?

     From pos 0, wrote 1048576 bytes. Next pos will be 1048576.
     diffTimeWriteOneBlock:         0.31s
     From pos 1048576, wrote 1048576 bytes. Next pos will be 2097152.
     diffTimeWriteOneBlock:          0.9s
     From pos 2097152, wrote 1048576 bytes. Next pos will be 3145728.
     diffTimeWriteOneBlock:         1.46s
     From pos 3145728, wrote 1048576 bytes. Next pos will be 4194304.
     diffTimeWriteOneBlock:         1.98s
     From pos 4194304, wrote 1048576 bytes. Next pos will be 5242880.
     diffTimeWriteOneBlock:         2.47s
     From pos 5242880, wrote 1048576 bytes. Next pos will be 6291456.
     diffTimeWriteOneBlock:         3.06s
     From pos 6291456, wrote 1048576 bytes. Next pos will be 7340032.
     diffTimeWriteOneBlock:         3.53s
     From pos 7340032, wrote 1048576 bytes. Next pos will be 8388608.
     diffTimeWriteOneBlock:         3.96s
     From pos 8388608, wrote 1048576 bytes. Next pos will be 9437184.
     diffTimeWriteOneBlock:         4.24s
     From pos 9437184, wrote 1048576 bytes. Next pos will be 10485760.
     diffTimeWriteOneBlock:         4.74s
     From pos 10485760, wrote 1048576 bytes. Next pos will be 11534336.
     diffTimeWriteOneBlock:         5.24s
     From pos 11534336, wrote 1048576 bytes. Next pos will be 12582912.
     diffTimeWriteOneBlock:         5.72s
     From pos 12582912, wrote 1048576 bytes. Next pos will be 13631488.
     diffTimeWriteOneBlock:         6.25s
     From pos 13631488, wrote 1048576 bytes. Next pos will be 14680064.
     diffTimeWriteOneBlock:         6.77s
     From pos 14680064, wrote 1048576 bytes. Next pos will be 15728640.
     diffTimeWriteOneBlock:         7.37s
     From pos 15728640, wrote 1048576 bytes. Next pos will be 16777216.
     diffTimeWriteOneBlock:         7.76s
     From pos 16777216, wrote 1048576 bytes. Next pos will be 17825792.
     diffTimeWriteOneBlock:         8.74s
     From pos 17825792, wrote 1048576 bytes. Next pos will be 18874368.
     diffTimeWriteOneBlock:         8.99s
     From pos 18874368, wrote 1048576 bytes. Next pos will be 19922944.
     diffTimeWriteOneBlock:         9.35s
     From pos 19922944, wrote 1048576 bytes. Next pos will be 20971520.
     diffTimeWriteOneBlock:         9.85s
     From pos 20971520, wrote 1048576 bytes. Next pos will be 22020096.
     diffTimeWriteOneBlock:        10.43s
     From pos 22020096, wrote 1048576 bytes. Next pos will be 23068672.
     diffTimeWriteOneBlock:        11.05s
     From pos 23068672, wrote 1048576 bytes. Next pos will be 24117248.
     diffTimeWriteOneBlock:        11.52s
     From pos 24117248, wrote 1048576 bytes. Next pos will be 25165824.
     diffTimeWriteOneBlock:        12.23s
     From pos 25165824, wrote 1048576 bytes. Next pos will be 26214400.
     diffTimeWriteOneBlock:        12.49s
     From pos 26214400, wrote 1048576 bytes. Next pos will be 27262976.
     diffTimeWriteOneBlock:         13.1s
     From pos 27262976, wrote 1048576 bytes. Next pos will be 28311552.
     diffTimeWriteOneBlock:        13.83s
     From pos 28311552, wrote 1048576 bytes. Next pos will be 29360128.
     diffTimeWriteOneBlock:        14.31s
     From pos 29360128, wrote 1048576 bytes. Next pos will be 30408704.
     diffTimeWriteOneBlock:        14.65s
     From pos 30408704, wrote 1048576 bytes. Next pos will be 31457280.
     diffTimeWriteOneBlock:        15.32s
     From pos 31457280, wrote 1048576 bytes. Next pos will be 32505856.
     diffTimeWriteOneBlock:        15.88s
     From pos 32505856, wrote 1048576 bytes. Next pos will be 33554432.
     diffTimeWriteOneBlock:        16.77s
     From pos 33554432, wrote 1048576 bytes. Next pos will be 34603008.
     diffTimeWriteOneBlock:         16.9s
     From pos 34603008, wrote 1048576 bytes. Next pos will be 35651584.
     diffTimeWriteOneBlock:        17.39s
     From pos 35651584, wrote 1048576 bytes. Next pos will be 36700160.
     diffTimeWriteOneBlock:        18.12s
     From pos 36700160, wrote 1048576 bytes. Next pos will be 37748736.
     diffTimeWriteOneBlock:        18.69s
     From pos 37748736, wrote 1048576 bytes. Next pos will be 38797312.
     diffTimeWriteOneBlock:        19.09s
     From 

Re: Finding the datanode ID

2010-08-07 Thread Ken Goodhope
You can use java.net.InetAddress.getLocalHost to determine what node
you are working on.

On Saturday, August 7, 2010, Denim Live denim.l...@yahoo.com wrote:
 Hi all,

 For some odd processing situation, I want to determine in each map and reduce
 task the node ID of the physical node on which a record is being processed. Is
 it possible to determine the node ID programmatically?





Re: adding new node to hadoop cluster

2010-07-22 Thread Ken Goodhope
The slaves file is only envoked for the start and stop scripts. If you
want to add a node to cluster without restarting it, just ssh to the
node and use the hadoop-daemon.sh script. The options you want are
start datanode and start tasktracker.

On 7/22/10, Khaled BEN BAHRI khaled.ben_ba...@it-sudparis.eu wrote:
 Hi,

 I want to add a new data node to my hadoop cluster, but the problem
 when add it in configuration files, i have ti restart deamons of
 hadoop to know that is a new data node added to cluster.

 please can any one know how to add new nodes without needing to
 restart hadoop.

 thanks in advance fo help

 Best regards
 khaled




Re: Task tracker and Data node not stopping

2010-07-15 Thread Ken Goodhope
Inside hadoop-env.sh, you will see a property that sets the directory for
pids to be written too.  Check which directory it is and then investigate
the possibility that some other process is deleting, or overwriting those
files.  If you are using NFS, with all nodes pointing at the same directory,
then it might be a matter of each node overwriting the same file.

Either way, the stop scripts look for those pid files, and used them to stop
the correct daemon.  If they are not found, or if the file contains the
wrong pid, the script will echo no process to stop.

On Thu, Jul 15, 2010 at 4:51 AM, Karthik Kumar karthik84ku...@gmail.comwrote:

 Hi,

  I am using a cluster of two machines one master and one slave. When i
 try to stop the cluster using stop-all.sh it is displaying as below. the
 task tracker and datanode are also not stopped in the slave. Please help me
 in solving this.

 stopping jobtracker
 160.110.150.29: no tasktracker to stop
 stopping namenode
 160.110.150.29: no datanode to stop
 localhost: stopping secondarynamenode


 --
 With Regards,
 Karthik



Re: Problem with socket timeouts

2010-07-15 Thread Ken Goodhope
What is your file descriptor limit?  If it is set to the 1024, you will want
to up that considerably.  Don't be afraid to go to 64k.

On Thu, Jul 15, 2010 at 3:38 AM, Peter Falk pe...@bugsoft.nu wrote:

 Hi,

 We are noticing the following errors in our datanode logs. We are running
 hbase on top of hdfs, but are not noticing any errors in the hbase logs. So
 it seems like the hdfs clients are not suffering from these errors.
 However,
 it would be nice to understand why they appear. We have upped the number of
 xcievers to 1024 and are having some throughts about there are too many
 sockets and some of them are timing out because they are not being used?
 Also we have set dfs.datanode.socket.write.timeout and dfs.socket.timeout
 to
 about 10 minutes. Anyone knows why these errors appears, and perhaps how to
 get rid of them?

 2010-07-15 12:31:02,307 WARN
 org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
 192.168.10.54:50010,
 storageID=DS-1341800526-192.168.10.54-50010-1278528499852, infoPort=50075,
 ipcPort=50020):Got exception while serving blk_3305049902326993023_98358 to
 /192.168.10.54:
 java.net.SocketTimeoutException: 60 millis timeout while waiting for
 channel to be ready for write. ch :
 java.nio.channels.SocketChannel[connected local=/192.168.10.54:50010
 remote=/
 192.168.10.54:46276]
at

 org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
at

 org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
at

 org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
at

 org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)
at

 org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400)
at

 org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180)
at
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
at java.lang.Thread.run(Thread.java:619)

 2010-07-15 12:31:02,308 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
 192.168.10.54:50010,
 storageID=DS-1341800526-192.168.10.54-50010-1278528499852, infoPort=50075,
 ipcPort=50020):DataXceiver
 java.net.SocketTimeoutException: 60 millis timeout while waiting for
 channel to be ready for write. ch :
 java.nio.channels.SocketChannel[connected local=/192.168.10.54:50010
 remote=/
 192.168.10.54:46276]
at

 org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
at

 org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
at

 org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
at

 org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)
at

 org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400)
at

 org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180)
at
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
at java.lang.Thread.run(Thread.java:619)

 TIA,
 Peter



Re: using 'fs -put' from datanode: all data written to that node's hdfs and not distributed

2010-07-12 Thread Ken Goodhope
All writes from a datanode leave one copy on the local node, one copy
on another node in the same rack, and a third on another rack if
available.

On 7/12/10, Nathan Grice ngr...@gmail.com wrote:
 We are trying to load data into hdfs from one of the slaves and when the put
 command is run from a slave(datanode) all of the blocks are written to the
 datanode's hdfs, and not distributed to all of the nodes in the cluster. It
 does not seem to matter what destination format we use ( /filename vs
 hdfs://master:9000/filename) it always behaves the same.
 Conversely, running the same command from the namenode distributes the files
 across the datanodes.

 Is there something I am missing?

 -Nathan



Re: How to control the number of map tasks for each nodes?

2010-07-08 Thread Ken Goodhope
If you want to have a different number of tasks for different nodes, you
will need to look at one of the more advanced schedulers.  FairScheduler and
CapacityScheduler are the most common.  FairScheduler has extensibility
points where you can add your own logic for deciding if a particular node
can schedule another task.  I believe CapacityScheduler does this too, but i
haven't used it as much.

On Thu, Jul 8, 2010 at 6:49 AM, Jones, Nick nick.jo...@amd.com wrote:

 Vitaliy/Edward,
 One thing to keep in mind is that overcommitting the number of cores can
 lead to map timeouts unless the map task submits progress updates to
 jobtracker.  I found out the hard way that with a few computationally
 expensive maps.

 Nick Jones

 -Original Message-
 From: Vitaliy Semochkin [mailto:vitaliy...@gmail.com]
 Sent: Thursday, July 08, 2010 5:15 AM
 To: common-user@hadoop.apache.org
 Subject: Re: How to control the number of map tasks for each nodes?

 Hi,

 in mapred-site.xml you should place

 property
  namemapred.tasktracker.map.tasks.maximum/name
  value8/value
   descriptionthe number of available cores on the tasktracker machines
 for map tasks
  /description
 /property
 property
  namemapred.tasktracker.reduce.tasks.maximum/name
  value8/value
   descriptionthe number of available cores on the tasktracker machines
 for reduce tasks
  /description
 /property

 where 8 is number of your CORES not CPUS, if you have 8 dual core
 processors
 place 16 there.
 I found out that having number of map tasks a bit bigger than number of
 cores is better cause sometimes hadoop waits for IO operations and task do
 nothing.

 Regards,
 Vitaliy S

 On Thu, Jul 8, 2010 at 1:07 PM, edward choi mp2...@gmail.com wrote:

  Hi,
 
  I have a cluster consisting of 11 slaves and a single master.
 
  The thing is that 3 of my slaves have i7 cpu which means that they can
 have
  up to 8 simultaneous processes.
  But other slaves only have dual core cpus.
 
  So I was wondering if I can specify the number of map tasks for each of
 my
  slaves.
  For example, I want to give 8 map tasks to the slaves that have i7 cpus
 and
  only two map tasks to the others.
 
  Is there a way to do this?
 




Re: How to access Reporter in new API?

2010-07-08 Thread Ken Goodhope
The reporter and the outputcollector have all been rolled up into the
context object in the new API.

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Mapper.Context.html

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Reducer.Context.html


On Thu, Jul 8, 2010 at 3:40 AM, Vitaliy Semochkin vitaliy...@gmail.comwrote:

 Hi,

 I'm using Mapper interface from new Hadoop API.
 How to access Reporter instance  in new API?


 PS If someone knows any article on logging and problem reporting in Hadoop
 please post a link here.

 Thanks in Advance,
 Vitaliy S



Re: how to add jar file to hadoop

2010-07-07 Thread Ken Goodhope
You have a couple choices. You can either add your jar file to a
directory already on the classpath, like hadoop/lib, or you can add
you jar file to the classpath set inside hadoop-env.sh located in the
conf directory.

Either way, you will need to restart mapred after doing so. After you
do that, your job should see the classes you have added.

On 7/7/10, Ahmad Shahzad ashahz...@gmail.com wrote:
 I want to add some utility to hadoop. For that i wanted to know that how can
 i add the jar file to hadoop directory and than i can access files in the
 jar from java files in hadoop directory ( for example TaskTracker.java). I
 am not writing any map-reduce program.

 Regards,
 Ahmad Shahzad



Re: decomission a node

2010-07-06 Thread Ken Goodhope
Inside the hdfs conf,

   property
  namedfs.hosts.exclude/name
  value/value
  descriptionNames a file that contains a list of hosts that are
  not permitted to connect to the namenode.  The full pathname of the
  file must be specified.  If the value is empty, no hosts are
  excluded./description
   /property

Point this property at a file containing a list of nodes you want to
decommision.  From there, use the command line hadoop dfsadmin
-refreshNodes.


On Tue, Jul 6, 2010 at 7:31 AM, Some Body someb...@squareplanet.de wrote:

 Hi,

 Is it possible move all the data blocks off a cluster node and then
 decommision the node?

 I'm asking because,  now that my MR job is working, I'd like see how things
 scale. I.e.,
  less processing nodes, amount of data (number  size of files, etc.). I
 currently have 8 nodes,
 and am processing 5GB spread across 2000 files.

 Alan



Re: why my Reduce Class does not work?

2010-07-04 Thread Ken Goodhope
You need @Override on your reduce method. Right now you are getting
the identity reduce method.

On 7/4/10, Vitaliy Semochkin vitaliy...@gmail.com wrote:
 Hi,

 I rewritten WordCount sample to use new Hadoop API

 however my reduce task doesn't launch.

 the result file always looks like
 some_word 1
 some_word 1
 another_word 1
 another_word 1

 ...

 Here is the code:

 import java.io.IOException;
 import java.util.StringTokenizer;

 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.IntWritable;
 import org.apache.hadoop.io.LongWritable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapreduce.Job;
 import org.apache.hadoop.mapreduce.Mapper;
 import org.apache.hadoop.mapreduce.Reducer;
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 public class WordCount {

 public static class WordCountMapper extends MapperLongWritable, Text, Text,
 IntWritable {

 @Override
 protected void map(LongWritable key, Text value, Context context) throws
 IOException, InterruptedException {
 StringTokenizer st = new StringTokenizer(value.toString());
 while (st.hasMoreTokens()) {
 context.write(new Text(st.nextToken()), new IntWritable(1));
 }
 }
 }

 public static class WordCountReduce extends ReducerText, IntWritable, Text,
 IntWritable {

 @SuppressWarnings(unchecked)
 public void reduce(Text key, IterableIntWritable values, Reducer.Context
 context) throws IOException, InterruptedException {
 int sum = 0;
 for (IntWritable value : values) {
 sum += value.get();
 }
 context.write(key, new IntWritable(sum));
 }
 }

 public static void main(String[] args) throws IOException,
 InterruptedException, ClassNotFoundException {
 Job job = new Job();
 job.setJobName(WordCounter);
 job.setJarByClass(WordCount.class);
 job.setMapperClass(WordCountMapper.class);
 job.setReducerClass(WordCountReduce.class);
 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(IntWritable.class);
 FileInputFormat.setInputPaths(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));
 System.exit(job.waitForCompletion(true)? 0 :1);
 }
 }

 Looks like WordCountReduce was never launched but I don't see any Warnings
 or Errors in log  file.

 Any help is highly appreciated.

 Thanks in Advance,
 Vitaliy S



Re: Intermediate files generated.

2010-07-02 Thread Ken Goodhope
You could also use multi output from the old api.  This will allow you to
create multiple output collectors.  One collector could be used at
the beginning of the reduce call for writing the key-value pairing
unaltered, and another collector for writing the results of your processing.

On Fri, Jul 2, 2010 at 5:17 AM, Pramy Bhats pramybh...@googlemail.comwrote:

 Hi,

 Isn't possible to hack-in the intermediate files generated ?

 I am writing a compilation framework, so i dont want to mess up with
 existing programming framework. The upper layer or the programmer should
 write the program the way he should write, and I want to leverage the
 intermediate file generated for my analysis.

 thanks,
 --PB.

 On Fri, Jul 2, 2010 at 1:05 PM, Jones, Nick nick.jo...@amd.com wrote:

  Hi Pramy,
  I would setup one M/R job to just map (setNumReducers=0) and chain
 another
  job that uses a unity mapper to pass the intermediate data to the reduce
  step.
 
  Nick
  Sent by radiation.
 
  - Original Message -
  From: Pramy Bhats pramybh...@googlemail.com
  To: common-user@hadoop.apache.org common-user@hadoop.apache.org
  Sent: Fri Jul 02 01:05:25 2010
  Subject: Re: Intermediate files generated.
 
  Hi Hemanth,
 
  I need to use the output of the mapper for some other application. As a
  result, if I can redirect the output of the map in temp files of my
 choice
  (which are stored on hdfs) then i can reuse the output later. At the same
  time, the succeeding reducer can read the input from this temp files
  without
  any overhead.
 
  thanks,
  --PB
 
  On Fri, Jul 2, 2010 at 3:52 AM, Hemanth Yamijala yhema...@gmail.com
  wrote:
 
   Alex,
  
I don't think this is what I am looking for. Essential, I wish to run
   both
mapper as well as reducer. But at the same time, i wish to make sure
  that
the temp files that are used between mappers and reducers are of my
   choice.
Here, the choice means that I can specify the files in HDFS that can
 be
   used
as temp files.
  
   Could you explain why you want to do this ?
  
   
thanks,
--PB.
   
On Fri, Jul 2, 2010 at 12:14 AM, Alex Loddengaard a...@cloudera.com
 
   wrote:
   
You could use the HDFS API from within your mapper, and run with 0
reducers.
   
Alex
   
On Thu, Jul 1, 2010 at 3:07 PM, Pramy Bhats 
  pramybh...@googlemail.com
wrote:
   
 Hi,

 I am using hadoop framework for writing MapReduce jobs. I want  to
redirect
 the output of Map into files of my choice and later use those
 files
  as
 input
 for Reduce phase.


 Could you please suggest, how to proceed for it ?

 thanks,
 --PB.

   
   
  
 
 



Re: Dynamically set mapred.tasktracker.map.tasks.maximum from inside a job.

2010-06-30 Thread Ken Goodhope
What you want to do can be accomplished in the scheduler. Take a look
at the fair scheduler, specifically the user extensible options. There
you will find the ability to add some extra logic for deciding if a
task can be launched on a per job basis. Could be as simple as
deciding a particular job can't launch more than 12 tasks at a time.

Capacity scheduler might be able to do this too, but I'm not sure.

On Wednesday, June 30, 2010, Pierre ANCELOT pierre...@gmail.com wrote:
 ok, well, thanks...
 I truely hoped a solution would exist for this.
 Thanks.

 Pierre.

 On Wed, Jun 30, 2010 at 3:56 PM, Yu Li car...@gmail.com wrote:

 Hi Pierre,

 The setNumReduceTasks method is for setting the number of reduce tasks to
 launch, it's equal to set the mapred.reduce.tasks parameter, while the
 mapred.tasktracker.reduce.tasks.maximum parameter decides the number of
 tasks running *concurrently* on one node.
 And as Amareshwari mentioned, the
 mapred.tasktracker.map/reduce.tasks.maximum is a cluster configuration
 which could not be set per job. If you set
 mapred.tasktracker.map.tasks.maximum to 20, and the overall number of map
 tasks is larger than 20*nodes number, there would be 20 map tasks running
 concurrently on a node. As I know, you probably need to restart the
 tasktracker if you truely need to change the configuration.

 Best Regards,
 Carp

 2010/6/30 Pierre ANCELOT pierre...@gmail.com

  Sure, but not the number of tasks running concurrently on a node at the
  same
  time.
 
 
 
  On Wed, Jun 30, 2010 at 1:57 PM, Ted Yu yuzhih...@gmail.com wrote:
 
   The number of map tasks is determined by InputSplit.
  
   On Wednesday, June 30, 2010, Pierre ANCELOT pierre...@gmail.com
 wrote:
Hi,
Okay, so, if I set the 20 by default, I could maybe limit the number
 of
concurrent maps per node instead?
job.setNumReduceTasks exists but I see no equivalent for maps, though
 I
think there was a setNumMapTasks before...
Was it removed? Why?
Any idea about how to acheive this?
   
Thank you.
   
   
On Wed, Jun 30, 2010 at 12:08 PM, Amareshwari Sri Ramadasu 
amar...@yahoo-inc.com wrote:
   
Hi Pierre,
   
mapred.tasktracker.map.tasks.maximum is a cluster level
  configuration,
cannot be set per job. It is loaded only while bringing up the
   TaskTracker.
   
Thanks
Amareshwari
   
On 6/30/10 3:05 PM, Pierre ANCELOT pierre...@gmail.com wrote:
   
Hi everyone :)
There's something I'm probably doing wrong but I can't seem to
 figure
   out
what.
I have two hadoop programs running one after the other.
This is done because they don't have the same needs in term of
  processor
   in
memory, so by separating them I optimize each task better.
Fact is, I need for the first job on every node
mapred.tasktracker.map.tasks.maximum set to 12.
For the second task, I need it to be set to 20.
so by default I set it to 12 and in the second job's code, I set
 this:
   
       Configuration hadoopConfiguration = new Configuration();
   
    hadoopConfiguration.setInt(mapred.tasktracker.map.tasks.maximum,
20);
   
But when running the job, instead of having the 20 tasks on each
 node
  as
expected, I have 12
Any idea please?
   
Thank you.
Pierre.
   
  --
 http://www.neko-consulting.com
 Ego sum quis ego servo
 Je suis ce que je protège
 I am what I protect



Re: newbie - job failing at reduce

2010-06-30 Thread Ken Goodhope
Have you increased your file handle limits?  You can check this with a
'ulimit -n' call.   If you are still at 1024, then you will want to increase
the limit to something quite a bit higher.

On Wed, Jun 30, 2010 at 9:40 AM, Siddharth Karandikar 
siddharth.karandi...@gmail.com wrote:

 Yeah. SSH is working as mentioned in the docs. Even directory
 mentioned for 'mapred.local.dir' has enough space.

 - Siddharth

 On Wed, Jun 30, 2010 at 10:01 PM, Chris Collord ccoll...@lanl.gov wrote:
  Interesting that the reduce phase makes it that far before failing!
  Are you able to SSH (without a password) into the failing node?  Any
  possible folder permissions issues?
  ~Chris
 
  On 06/30/2010 10:26 AM, Siddharth Karandikar wrote:
 
  Hey Chris,
  Thanks for your inputs. I have tried most of the stuff, but will
  surely go though tutorial you have pointed out. May be I will get some
  hint there.
 
  Interestingly, while experimenting with it more, I noticed that, if
  small size input file is there (50MBs) the job works perfectly fine.
  If I give bigger input, it starts hanging @ reduce tasks. Map phase
  always finishes 100%.
 
  - Siddharth
 
 
  On Wed, Jun 30, 2010 at 9:11 PM, Chris Collordccoll...@lanl.gov
  wrote:
 
 
  Hi Siddharth,
  I'm VERY new to this myself, but here are a few thoughts (since nobody
  else
  is responding!).
  -You might want to set dfs.replication to 2.  I have read that for
  clusters
8, you should have replication set to 2 machines.  8+ node clusters
  use 3.
   This may make your cluster work, but it won't fix your problem.
  -Run a bin/hadoop dfsadmin -report with the hadoop cluster running
 and
  see
  what it shows for your failing node.
  -Check your logs/ folder for datanode logs and see if there's
 anything
  useful in there before the error you're getting.
  -You might try reformatting your hdfs, if you don't have anything
  important
  in there.  bin/hadoop namenode -format.  (Note: this has caused
  problems
  for me in the past with namenode ID's, see the bottom on the link for
  Michael Noll's tutorial if that happens)
 
  You should check out Michael Noll's tutorial for all the little
 details:
 
 
 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
 
  Let me know if anything helps!
  ~Chris
 
 
 
  On 06/30/2010 04:02 AM, Siddharth Karandikar wrote:
 
 
  Anyone?
 
 
  On Tue, Jun 29, 2010 at 8:41 PM, Siddharth Karandikar
  siddharth.karandi...@gmail.comwrote:
 
 
 
  Hi All,
 
  I am new to Hadoop, but by reading online docs and other resource, I
  have moved ahead and now trying to run a cluster of 3 nodes.
  Before doing this, tried my program on standalone and pseudo systems
  and thats working fine.
 
  Now the issue that I am facing - mapping phase works correctly. While
  doing reduce, I am seeing following error on one of the nodes -
 
  2010-06-29 14:35:01,848 WARN org.apache.hadoop.mapred.TaskTracker:
  getMapOutput(attempt_201006291958_0001_m_08_0,0) failed :
  org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
 
 
 
 taskTracker/jobcache/job_201006291958_0001/attempt_201006291958_0001_m_08_0/output/file.out.index
  in any of the configured local directories
 
  Lets say this is @ Node1. But there is no such directory named
 
 
 
 'taskTracker/jobcache/job_201006291958_0001/attempt_201006291958_0001_m_08_0'
  under /tmp/mapred/local/taskTracker/ on Node1. Interestingly, this
  directory is available on Node2 (or Node3). Tried running the job
  multiple times, but its always failing while reducing. Same error.
 
  I have configured /tmp/mapred/local on each node from
 mapred-site.xml.
 
  I really don't understand why mappers are misplacing these files? Or
  am I missing something in configuration?
 
  If someone wants to look @ configurations, I have pasted that below.
 
  Thanks,
  Siddharth
 
 
  Configurations
  ==
 
  conf/core-site.xml
  ---
 
  ?xml version=1.0?
  ?xml-stylesheet type=text/xsl href=configuration.xsl?
  configuration
   property
 namefs.default.name/name
 valuehdfs://192.168.2.115//value
   /property
  /configuration
 
 
  conf/hdfs-site.xml
  --
  ?xml version=1.0?
  ?xml-stylesheet type=text/xsl href=configuration.xsl?
  configuration
   property
 namefs.default.name/name
 valuehdfs://192.168.2.115/value
   /property
   property
 namedfs.data.dir/name
 value/home/siddharth/hdfs/data/value
   /property
   property
 namedfs.name.dir/name
 value/home/siddharth/hdfs/name/value
   /property
   property
 namedfs.replication/name
 value3/value
   /property
  /configuration
 
  conf/mapred-site.xml
  --
  ?xml version=1.0?
  ?xml-stylesheet type=text/xsl href=configuration.xsl?
  configuration
   property
 namemapred.job.tracker/name
 value192.168.2.115:8021/value
   /property
   property
 namemapred.local.dir/name
 

Re: copy files to HDFS protocols

2010-06-30 Thread Ken Goodhope
Also, hadoop fs -put can read from stdin by specifying '-' as your input
file.

On Wed, Jun 30, 2010 at 2:19 AM, Jeff Zhang zjf...@gmail.com wrote:

 1. You can use mount the files on another machine to your local machine,
 and
 invoke the copyFromLocal
 2. Sure, you can write stream to HDFS, but I'm afraid you have to use Java
 API of hadoop, refer FileSystem.java in hadoop



 On Wed, Jun 30, 2010 at 4:59 PM, Oleg Ruchovets oruchov...@gmail.com
 wrote:

  Hi ,
 
  I use HDFS Shell to copy files from local FileSystem  to Hadoop HDFS.
  (copyFromLocal command).
 
  1) How can I provide Path to file which locates on local FS but on
  different
  machine then hadoop locates?
  2) What other protocols (way) can I use to write files to HDFS?  Is it
  possible to use streaming data to write to HDFS ?
 
  Thank in Advance
  Oleg.
 



 --
 Best Regards

 Jeff Zhang



Re: do all mappers finish before reducer starts

2010-02-01 Thread Ken Goodhope
The reduce function is always called after all map tasks are complete.  This
is not to be confused with the reduce task.  The reduce task can be
launched and begin copying data as soon as the first mapper completes.  By
default though, reduce tasks are not launched until 5% of the mappers are
completed.

2010/2/1 Jeyendran Balakrishnan jbalakrish...@docomolabs-usa.com

 Correct me if I'm wrong, but this:

  Yes, any reduce function call should be after all the mappers have done
  their work.

 is strictly true only if speculative execution is explicitly turned off.
 Otherwise there is a chance that some reduce tasks can actually start before
 all the maps are complete. In case it turns out that some map output key
 used by one speculative reduce task is output by some other map after this
 reduce task has started, I think the JT then kills this speculative task.



 -Original Message-
 From: Gang Luo [mailto:lgpub...@yahoo.com.cn]
 Sent: Friday, January 29, 2010 2:27 PM
 To: common-user@hadoop.apache.org
 Subject: Re: do all mappers finish before reducer starts

 It seems this is a hot issue

 When any mapper finishes (the sorted intermediate result is on local disk),
 the shuffle start to transfer the result to corresponding reducers, even
 other mappers are still working.  For the shuffle is part of the reduce
 phase, the map phase and reduce phase could be seen overlap to some extend.
 That is why you see such a progress report.

 What you actually mentioned is the reduce function. Yes, any reduce
 function call should be after all the mappers have done their work.

  -Gang


 - 原始邮件 
 发件人: adeelmahmood adeelmahm...@gmail.com
 收件人: core-u...@hadoop.apache.org
 发送日期: 2010/1/29 (周五) 4:10:50 下午
 主   题: do all mappers finish before reducer starts


 I just have a conceptual question. My understanding is that all the mappers
 have to complete their job for the reducers to start working because
 mappers
 dont know about each other so we need values for a given key from all the
 different mappers so we have to wait until all mappers have collectively
 given the system all possible values for a key .so that then that can be
 passed on the reducer ..
 but when I ran these jobs .. almost everytime before the mappers are all
 done the reducers start working .. so it would say map 60% reduce 30% ..
 how
 does this works
 Does it finds all possibly values for a single key from all mappers .. pass
 that on the reducer and then works on other keys
 any help is appreciated
 --
 View this message in context:
 http://old.nabble.com/do-all-mappers-finish-before-reducer-starts-tp27330927p27330927.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.


  ___
  好玩贺卡等你发,邮箱贺卡全新上线!
 http://card.mail.cn.yahoo.com/



Re: Could not obtain block

2010-01-29 Thread Ken Goodhope
Could not obtain block errors are often caused by running out of available
file handles.  You can confirm this by going to the shell and entering
ulimit -n.  If it says 1024, the default, then you will want to increase
it to about 64,000.

On Fri, Jan 29, 2010 at 4:06 PM, MilleBii mille...@gmail.com wrote:

 X-POST with Nutch mailing list.

 HEEELP !!!

 Kind of get stuck on this one.
 I backed-up my hdfs data, reformated the hdfs, put data back, try to merge
 my segments together and it explodes again.

 Exception in thread Lucene Merge Thread #0
 org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException:
 Could not obtain block: blk_4670839132945043210_1585
 file=/user/nutch/crawl/indexed-segments/20100113003609/part-0/_ym.frq
at

 org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:309)

 If I go into the hfds/data directory I DO find the faulty block 
 Could it be a synchro problem on the segment merger code ?

 2010/1/29 MilleBii mille...@gmail.com

  I'm looking for some help. I'm Nutch user, everything was working fine,
 but
  now I get the following error when indexing.
  I have a single note pseudo distributed set up.
  Some people on the Nutch list indicated to me that I could full, so I
  remove many things and hdfs is far from full.
  This file  directory was perfectly OK the day before.
  I did a hadoop fsck... report says healthy.
 
  What can I do ?
 
  Is is safe to do a Linux FSCK just in case ?
 
  Caused by: java.io.IOException: Could not obtain block:
  blk_8851198258748412820_9031
 
 file=/user/nutch/crawl/indexed-segments/20100111233601/part-0/_103.frq
 
 
  --
  -MilleBii-
 



 --
 -MilleBii-




-- 
Ken Goodhope
Cell: 425-750-5616

362 Bellevue Way NE Apt N415
Bellevue WA, 98004