Map and Reduce numbers are not restricted by setNumMapTasks and setNumReduceTasks, JobConf related?

2008-10-06 Thread Andy Li
Dears,

Sorry, I did not mean to cross post.  But the previous article was
accidentally posted to the HBase user list.  I would like to bring it back
to the Hadoop user since it is confusing me a lot and it is mainly MapReduce
related.

Currently running version hadoop-0.18.1 on 25 nodes.  Map and Reduce Task
Capacity is 92.  When I do this in my MapReduce program:

= SAMPLE CODE =
JobConf jconf = new JobConf(conf, TestTask.class);
jconf.setJobName(my.test.TestTask);
jconf.setOutputKeyClass(Text.class);
jconf.setOutputValueClass(Text.class);
jconf.setOutputFormat(TextOutputFormat.class);
jconf.setMapperClass(MyMapper.class);
jconf.setCombinerClass(MyReducer.class);
jconf.setReducerClass(MyReducer.class);
jconf.setInputFormat(TextInputFormat.class);
try {
jconf.setNumMapTasks(5);
jconf.setNumReduceTasks(3);
JobClient.runJob(jconf);
} catch (Exception e) {
e.printStackTrace();
}
= = =

When I run the job, I'm always getting 300 mappers and 1 reducers from the
JobTracker webpage running on the default port 50030.
No matter how I configure the numbers in methods setNumMapTasks and
setNumReduceTasks, I get the same result.
Does anyone know why this is happening?
Am I missing something or misunderstand something in the picture?  =(

Here's a reference to the parameters we have override in hadoop-site.xml.
===
property
  namemapred.tasktracker.map.tasks.maximum/name
  value4/value
/property

property
  namemapred.tasktracker.reduce.tasks.maximum/name
  value4/value
/property

other parameters are default from hadoop-default.xml.

Any idea how this is happening?

Any inputs are appreciated.

Thanks,
-Andy


Re: [some bugs] Re: file permission problem

2008-03-14 Thread Andy Li
I think this is the same problem related to this mail thread.

http://www.mail-archive.com/[EMAIL PROTECTED]/msg02759.html

A JIRA has been filed, please see HADOOP-2915.

On Fri, Mar 14, 2008 at 2:08 AM, Stefan Groschupf [EMAIL PROTECTED] wrote:

 Hi,
 any magic we can do with hadoop.dfs.umask? Or is there any other off
 switch for the file security?
 Thanks.
 Stefan
 On Mar 13, 2008, at 11:26 PM, Stefan Groschupf wrote:

  Hi Nicholas, Hi All,
 
  I definitely can reproduce the problem Johannes describes.
  Also from debugging through the code it is clearly a bug from my
  point of view.
  So this is the call stack:
  SequenceFile.createWriter
  FileSystem.create
  DFSClient.create
  namenode.create
  In NameNode I found this:
  namesystem.startFile(src,
 new PermissionStatus(Server.getUserInfo().getUserName(),
  null, masked),
 clientName, clientMachine, overwrite, replication, blockSize);
 
  In getUserInfo is this comment:
  // This is to support local calls (as opposed to rpc ones) to the
  name-node.
 // Currently it is name-node specific and should be placed
  somewhere else.
 try {
   return UnixUserGroupInformation.login();
  The login javaDoc says:
  /**
* Get current user's name and the names of all its groups from Unix.
* It's assumed that there is only one UGI per user. If this user
  already
* has a UGI in the ugi map, return the ugi in the map.
* Otherwise get the current user's information from Unix, store it
* in the map, and return it.
*/
 
  Beside of that I had some interesting observations.
  If I have permissions to write to a folder A I can delete folder A
  and file B that is inside of folder A even if I do have no
  permissions for B.
 
  Also I noticed following in my dfs
  [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/
  myApp-1205474968598
  Found 1 items
  /user/joa23/myApp-1205474968598/VOICE_CALLdir   2008-03-13
 16:00
  rwxr-xr-x hadoop  supergroup
  [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/
  myApp-1205474968598/VOICE_CALL
  Found 1 items
  /user/joa23/myApp-1205474968598/VOICE_CALL/part-0 r 3   27311
  2008-03-13 16:00  rw-r--r--   joa23   supergroup
 
  Do I miss something or was I able to write as user joa23 into a
  folder owned by hadoop where I should have no permissions. :-O.
  Should I open some jira issues?
 
  Stefan
 
 
 
 
 
  On Mar 13, 2008, at 10:55 AM, [EMAIL PROTECTED] wrote:
 
  Hi Johannes,
 
  i'm using the 0.16.0 distribution.
  I assume you mean the 0.16.0 release (
 http://hadoop.apache.org/core/releases.html
  ) without any additional patch.
 
  I just have tried it but cannot reproduce the problem you
  described.  I did the following:
  1) start a cluster with tsz
  2) run a job with nicholas
 
  The output directory and files are owned by nicholas.  Am I doing
  the same thing you did?  Could you try again?
 
  Nicholas
 
 
  - Original Message 
  From: Johannes Zillmann [EMAIL PROTECTED]
  To: core-user@hadoop.apache.org
  Sent: Wednesday, March 12, 2008 5:47:27 PM
  Subject: file permission problem
 
  Hi,
 
  i have a question regarding the file permissions.
  I have a kind of workflow where i submit a job from my laptop to a
  remote hadoop cluster.
  After the job finished i do some file operations on the generated
  output.
  The cluster-user is different to the laptop-user. As output i
  specify a directory inside the users home. This output directory,
  created through the map-reduce job has cluster-user permissions,
  so
  this does not allow me to move or delete the output folder with my
  laptop-user.
 
  So it looks as follow:
  /user/jz/  rwxrwxrwx jzsupergroup
  /user/jz/output   rwxr-xr-xhadoopsupergroup
 
  I tried different things to achieve what i want (moving/deleting the
  output folder):
  - jobConf.setUser(hadoop) on the client side
  - System.setProperty(user.name,hadoop) before jobConf
  instantiation
  on the client side
  - add user.name node in the hadoop-site.xml on the client side
  - setPermision(777) on the home folder on the client side (does
  not work
  recursiv)
  - setPermision(777) on the output folder on the client side
  (permission
  denied)
  - create the output folder before running the job (Output directory
  already exists exception)
 
  None of the things i tried worked. Is there a way to achieve what
  i want ?
  Any ideas appreciated!
 
  cheers
  Johannes
 
 
 
 
 
  --
  ~~~
  101tec GmbH
 
  Halle (Saale), Saxony-Anhalt, Germany
  http://www.101tec.com
 
 
 
 
 
  ~~~
  101tec Inc.
  Menlo Park, California, USA
  http://www.101tec.com
 
 
 

 ~~~
 101tec Inc.
 Menlo Park, California, USA
 http://www.101tec.com





Re: long write operations and data recovery

2008-02-29 Thread Andy Li
What about a hot standby namenode?

For write-ahead-log to avoid crash and recovery, I think this is fine for
small I/O.
For large volume, the write-ahead-log will actually take up the system IO
resource
pretty much that makes 2 IO per block (log and the actual data).  This will
fall back
how current database design implements recovery and crash.

Another thing I don't see in the picture is how Hadoop manage system file
system
instructions.  Each system has different implementation on their file system
and I believe
that by calling 'write' or 'flush' does not really flush the data to the
disk.  Not sure if this is
inevitable and platform OS dependent, but I cannot find any documents to
describe how
Hadoop handle this.

P.S. I handle HA and fail-over mechanism in my own application, but I think
for a framwork,
it should be transparent (semi-transparent) to the user.

-annndy

On Fri, Feb 29, 2008 at 1:54 PM, Joydeep Sen Sarma [EMAIL PROTECTED]
wrote:

 I would agree with Ted. You should easily be able to get 100MBps write
 throughput on a standard Netapp box (with read bandwidth left over -
 since the peak write throughput rating is more than twice of that). Even
 at an average write throughput rate of 50MBps - the daily data volume
 would be (drumroll ..) 4+TB!

 So buffer to a decent box and copy stuff over ..

 -Original Message-
 From: Ted Dunning [mailto:[EMAIL PROTECTED]
 Sent: Friday, February 29, 2008 11:33 AM
 To: core-user@hadoop.apache.org
 Subject: Re: long write operations and data recovery


 Unless your volume is MUCH higher than ours, I think you can get by with
 a
 relatively small farm of log consolidators that collect and concatenate
 files.

 If each log line is 100 bytes after compression (that is huge really)
 and
 you have 10,000 events per second (also pretty danged high) then you are
 only writing 1MB/s.  If you need a day of buffering (=100,000 seconds),
 then
 you need 100GB of buffer storage.  These are very, very moderate
 requirements for your ingestion point.


 On 2/29/08 11:18 AM, Steve Sapovits [EMAIL PROTECTED] wrote:

  Ted Dunning wrote:
 
  In our case, we looked at the problem and decided that Hadoop wasn't
  feasible for our real-time needs in any case.  There were several
  issues,
 
  - first, of all, map-reduce itself didn't seem very plausible for
  real-time applications.  That left hbase and hdfs as the capabilities
  offered by hadoop (for real-time stuff)
 
  We'll be using map-reduce batch mode, so we're okay there.
 
  The upshot is that we use hadoop extensively for batch operations
  where it really shines.  The other nice effect is that we don't have
  to worry all that much about HA (at least not real-time HA) since we
  don't do real-time with hadoop.
 
  What I'm struggling with is the write side of things.  We'll have a
 huge
  amount of data to write that's essentially a log format.  It would
 seem
  that writing that outside of HDFS then trying to batch import it would
  be a losing battle -- that you would need the distributed nature of
 HDFS
  to do very large volume writes directly and wouldn't easily be able to
 take
  some other flat storage model and feed it in as a secondary step
 without
  having the HDFS side start to lag behind.
 
  The realization is that Name Node could go down so we'll have to have
 a
  backup store that might be used during temporary outages, but that
  most of the writes would be direct HDFS updates.
 
  The alternative would seem to be to end up with a set of distributed
 files
  without some unifying distributed file system (e.g., like lots of
 Apache
  web logs on many many individual boxes) and then have to come up with
  some way to funnel those back into HDFS.




Re: Questions regarding configuration parameters...

2008-02-21 Thread Andy Li
Try the 2 parameters to utilize all the cores per node/host.

property
  namemapred.tasktracker.map.tasks.maximum/name
  value7/value
  descriptionThe maximum number of map tasks that will be run
  simultaneously by a task tracker.
  /description
/property

property
  namemapred.tasktracker.reduce.tasks.maximum/name
  value7/value
  descriptionThe maximum number of reduce tasks that will be run
  simultaneously by a task tracker.
  /description
/property

The default value are 2 so you might only see 2 cores used by Hadoop per
node/host.
If each system/machine has 4 cores (dual dual core), then you can change
them to 3.

Hope this works for you.

-Andy


On Wed, Feb 20, 2008 at 9:30 AM, C G [EMAIL PROTECTED] wrote:

 Hi All:

  The documentation for the configuration parameters mapred.map.tasks and
 mapred.reduce.tasks discuss these  values in terms of number of available
 hosts in the grid.  This description strikes me as a bit odd given that a
 host could be anything from a uniprocessor to an N-way box, where values
 for N could vary from 2..16 or more.  The documentation is also vague about
 computing the actual value.  For example, for mapred.map.tasks the doc
 says …a prime number several times greater….  I'm curious about how people
 are interpreting the descriptions and what values people are using.
  Specifically, I'm wondering if I should be using core count instead of
 host count to set these values.

  In the specific case of my system, we have 24 hosts where each host is a
 4-way system (i.e. 96 cores total).  For mapred.map.tasks I chose the
 value 173, as that is a prime number which is near 7*24.  For
 mapred.reduce.tasks I chose 23 since that is a prime number close to 24.
  Is this what was intended?

  Beyond curiousity, I'm concerned about setting these values and other
 configuration parameters correctly because I am pursuing some performance
 issues where it is taking a very long time to process small amounts of data.
  I am hoping that some amount of tuning will resolve the problems.

  Any thoughts and insights most appreciated.

  Thanks,
   C G



 -
 Never miss a thing.   Make Yahoo your homepage.