Controlling maps per Node on 0.19.0 working?

2008-11-30 Thread tim robertson
Hi,

I am a newbie so please excuse if I am doing something wrong:

in hadoop-site.xml I have the following since I have a very memory
intensive map:
property
  namemapred.tasktracker.map.tasks.maximum/name
  value1/value
/property

property
  namemapred.tasktracker.reduce.tasks.maximum/name
  value1/value
/property

But always on my EC2 cluster of 1 node (for testing) he is starting 3
on the node and then dying with memory problems...

Am I missing something?

Thanks,

Tim


Re: Newbie: Problem splitting a tab file into many (20,000) files

2008-11-30 Thread tim robertson
Thanks Brian - I really appreciate the help and insights!

I have now moved to 0.19.0 as you suggest.  I was indeed running heavy
Map tasks but was running only 1 per node.

I am attempting to cross reference 150M records each with a geospatial
Point (latitude and longitude) with a separate list of 120,000
polygons...  PostGIS is performing really badly with this, and our
150million looks to increase to Billions, so I am looking for a
scalable process - hence Hadoop.

Since you recommend not splitting into many files (previously I was
splitting the 150M into lots of smaller files based on area and then
only using these as inputs to a polygon cross reference to reduce the
numbers),  I am now able to keep the polygons in memory with a
geospatial index, and then just run over the full input file of point
records in one pass.  I am struggling to make it only run 1 JVM per
node in 0.19.0 though, despite having:

property
  namemapred.tasktracker.map.tasks.maximum/name
  value1/value
/property

property
  namemapred.tasktracker.reduce.tasks.maximum/name
  value1/value
/property

Am I missing some configuration to enforce this?

Thanks all for any advice - I am new to hadoop...

Tim




On Fri, Nov 28, 2008 at 6:10 PM, Brian Bockelman [EMAIL PROTECTED] wrote:
 Hey Tim,

 1) You have Ganglia enabled, and it's unfortunately broken in that release.
  :(  Turn it off, and you'll have much better luck.
 2) The failed to report status for 603 seconds indicates to me that you
 might be overloading an individual datanode.  Look at the Ganglia interface
 for the node and see if it is perhaps way overloaded or swapping?  I see you
 are using 1.4GB per map task: each small instance has 1.7GB.  Are you
 running multiple map tasks per node and going into swap?
 3) Similarly, are you overloading your namenode through memory exhaustion?
  I'm no expert, but I'd speculate some of those messages could be causing
 this.

 It seems that you have a memory-intensive map.  Could I recommend upgrading
 to 0.19.0 to take advantage of the improved JVM usage?  I believe this would
 reduce the memory usage of your application greatly.

 Additionally, can I recommend that you re-visit the requirement to have 20k
 files outputted per job?  Remember Hadoop is heavily optimized for
 medium-to-large size files.  Is it not an option to use SequenceFile to
 combine those into a much smaller number of multi-hundred-MB files?  You can
 probably make your setup work eventually, but it'll be a bit like fighting
 the tide.  Alternately, if you must have random-record access, try putting
 your results into HBase.

 Hope this helps!

 Brian

 On Nov 28, 2008, at 2:14 AM, tim robertson wrote:

 I guess no one is doing this kind of thing on hadoop?

 Here is my latest error (run on20 Ec2 small nodes, 200 Maps, 20
 Reducers, -Xmx1400M).

 Any help, ideas for trying, greatly appreciated!

 Cheers,

 Tim

 ...
 08/11/27 12:31:23 INFO mapred.JobClient:  map 100% reduce 73%
 08/11/27 12:31:36 INFO mapred.JobClient:  map 100% reduce 74%
 08/11/27 12:32:09 INFO mapred.JobClient:  map 100% reduce 75%
 08/11/27 12:41:50 INFO mapred.JobClient:  map 100% reduce 71%
 08/11/27 12:41:50 INFO mapred.JobClient: Task Id :
 attempt_200811271043_0008_r_06_0, Status : FAILED
 java.io.IOException: Could not get block locations. Aborting...
   at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2143)
   at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735)
   at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889)

 Task attempt_200811271043_0008_r_06_0 failed to report status for
 603 seconds. Killing!
 attempt_200811271043_0008_r_06_0: Exception in thread Timer
 thread for monitoring mapred java.lang.NullPointerException
 attempt_200811271043_0008_r_06_0:   at

 org.apache.hadoop.metrics.ganglia.GangliaContext.xdr_string(GangliaContext.java:195)
 attempt_200811271043_0008_r_06_0:   at

 org.apache.hadoop.metrics.ganglia.GangliaContext.emitMetric(GangliaContext.java:138)
 attempt_200811271043_0008_r_06_0:   at

 org.apache.hadoop.metrics.ganglia.GangliaContext.emitRecord(GangliaContext.java:123)
 attempt_200811271043_0008_r_06_0:   at

 org.apache.hadoop.metrics.spi.AbstractMetricsContext.emitRecords(AbstractMetricsContext.java:304)
 attempt_200811271043_0008_r_06_0:   at

 org.apache.hadoop.metrics.spi.AbstractMetricsContext.timerEvent(AbstractMetricsContext.java:290)
 attempt_200811271043_0008_r_06_0:   at

 org.apache.hadoop.metrics.spi.AbstractMetricsContext.access$000(AbstractMetricsContext.java:50)
 attempt_200811271043_0008_r_06_0:   at

 org.apache.hadoop.metrics.spi.AbstractMetricsContext$1.run(AbstractMetricsContext.java:249)
 attempt_200811271043_0008_r_06_0:   at
 java.util.TimerThread.mainLoop(Timer.java:512)
 attempt_200811271043_0008_r_06_0:   at
 java.util.TimerThread.run(Timer.java:462)
 

Re: Controlling maps per Node on 0.19.0 working?

2008-11-30 Thread tim robertson
Ok - apologies, it seems changes to the hadoop-site.xml are not
automatically picked up after the cluster is running.

Cheers

Tim

On Sun, Nov 30, 2008 at 12:48 PM, tim robertson
[EMAIL PROTECTED] wrote:
 Hi,

 I am a newbie so please excuse if I am doing something wrong:

 in hadoop-site.xml I have the following since I have a very memory
 intensive map:
 property
  namemapred.tasktracker.map.tasks.maximum/name
  value1/value
 /property

 property
  namemapred.tasktracker.reduce.tasks.maximum/name
  value1/value
 /property

 But always on my EC2 cluster of 1 node (for testing) he is starting 3
 on the node and then dying with memory problems...

 Am I missing something?

 Thanks,

 Tim



Re: Lookup HashMap available within the Map

2008-11-30 Thread Shane Butler
Given the goal of a shared data accessable across the Map instances,
can someone please explain some of the differences between using:
- setNumTasksToExecutePerJvm() and then having statically declared
data initialised in Mapper.configure(); and
- a MultithreadedMapRunner?

Regards,
Shane


On Wed, Nov 26, 2008 at 6:41 AM, Doug Cutting [EMAIL PROTECTED] wrote:
 tim robertson wrote:

 Thanks Alex - this will allow me to share the shapefile, but I need to
 one time only per job per jvm read it, parse it and store the
 objects in the index.
 Is the Mapper.configure() the best place to do this?  E.g. will it
 only be called once per job?

 In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM.
  So, yes, you could access a static cache from Mapper.configure().

 Doug




For those using Hadoop in the social network domain

2008-11-30 Thread Pete Wyckoff



  SOCIAL NETWORK SYSTEMS 2009 (SNS-2009)
  =
  Second ACM Workshop on Social Network Systems
 March 31, EuroSys 2009
   Nuremberg, Germany
   http://www.eecs.harvard.edu/~stein/SocialNets-2009/

OVERVIEW

The Second Workshop on Social Network Systems (SNS'08) will gather
researchers to discuss novel ideas about computer systems and social
networks.

Online social networks are among the most popular sites on the Web and
continue to grow rapidly. They provide mechanisms to establish identities,
share information, and create relationships. The resulting social graph
provides a basis for communicating and distributing and locating content.

Broadly, the systems issues of social networks include:

* How can systems infrastructure be improved for social networks?
Infrastructure includes database systems, operating systems, file systems,
and storage systems.

* How can the social graph be leveraged in computer system design? The
social graph encodes trust and common interests. How and to what extent can
this encoding be used to improve computer systems?

* How can social networks be modeled and characterized? What has been
learned from the operation of existing systems?

Topics of interest include, but are not limited to:

* Security and privacy.
* Leveraging the social graph in systems design.
* Real-time monitoring and query processing.
* Database issues for offline analysis.
* Experiences with deployed systems.
* Crawlers and other mechanisms for observing social network structure.
* Measurement and analysis, including comparative analysis.
* Tools for designing and deploying social networks.
* Network dynamics, relationships between network links and user behavior.
* Benchmarks, modeling, and characterization.
* Decentralization: methods for integrating multiple networks.
* Application programming interfaces (APIs) for social networks.

The papers presented, as well as a summary of the discussion, will be
archived electronically. Accepted papers may be subsequently revised,
expanded, and submitted to full conferences and journals.

ORGANIZERS:

Chair:
Lex Stein, Facebook

Program Committee:

Samuel Bernard, LIP6
Meeyoung Cha, MPI-SWS
Wei Chen, Microsoft Research Asia
Yafei Dai, Peking University
Adrienne Felt, UC Berkeley
Eran Gabber, Google
Bingsheng He, Microsoft Research Asia
Anne-Marie Kermarrec, INRIA
Peter Key, Microsoft Research Cambridge
Chris Lesniewski-Laas, MIT
Shiding Lin, Baidu
Alan Mislove, MPI-SWS and Rice University
Yoann Padioleau, UIUC
Peter Pietzuch, Imperial College London
Stefan Saroiu, Microsoft Research Redmond
Rodrigo Schmidt, Facebook
Jacky Shen, Microsoft Research Asia
Steven Smaldone, Rutgers
Lex Stein, Facebook
Jacob Strauss, MIT
Nguyen Tran, NYU
Edward Wang, Google
David Wei, Facebook
Geoffrey Werner-Allen, Harvard
Eiko Yoneki, University of Cambridge

IMPORTANT DATES

Paper submissions due:February 2, 2009
Notification to authors:  February 16, 2009
Workshop: March 31, 2009

SUBMITTING A PAPER

Papers must be received by 23:59 GMT, on January 26, 2009. This is a hard
deadline. Submissions should contain six or fewer two-column pages,
including all figures and references, using 10-point fonts, standard
spacing, and 1-inch margins (we recommend the ACM sig-alternate template,
LaTeX template available
at http://www.eecs.harvard.edu/~stein/sig-alternate-10pt.cls. Please number
pages. All submissions will be electronic, and must be in either PDF format
(preferred) or PostScript. Author names and affiliations should appear on
the title page. Reviewing will be single-blind.

This workshop is sponsored by ACM, ACM SigOps, and EuroSys.






RE: Did Hadoop support gz/zip format file?

2008-11-30 Thread zhuweimin
Hello

I have a requirement of use file with the extension .Z(the file is created
by UNIX compress command) in Hadoop.
It will too be automatically recognized / handled ?

Any suggestion on how to handle the .Z file in the map task?

Thanks
Best Regards

Alamo

-Original Message-
From: Abdul Qadeer [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 03, 2008 4:19 PM
To: core-user@hadoop.apache.org
Subject: Re: Did Hadoop support gz/zip format file?

Hadoop supports gzip format by means of its gzip codecs.
If you meant to use default input format / record reader of
Hadoop, it will automatically recognize / handle your gzip
input files.  And if you are using some other input formats / record
readers, you can use the gzip codec provided by Hadoop to
de-compress the data and then interpret it in your record reader.

On Sun, Nov 2, 2008 at 7:49 PM, howardh [EMAIL PROTECTED] wrote:

 Hello,

 I have a requirement of use gz/zip format in Hadoop,   After these days
 research and learning,   It seems Hadoop didn't support gz/zip file yet,
  is that true?
 Like I'm going to create file with gz format,  and read it later by
 FileSystem interface. Does it feasible? Experts!  Could you show
me
 some advices?

 Best Regards
 2008-11-03



 howardh





RE: Hadoop Internal Architecture writeup

2008-11-30 Thread Amar Kamat
Hey, nice work and nice writeup. Keep it up.
Comments inline.
Amar


-Original Message-
From: Ricky Ho [mailto:[EMAIL PROTECTED]
Sent: Fri 11/28/2008 9:45 AM
To: core-user@hadoop.apache.org
Subject: RE: Hadoop Internal Architecture writeup
 
Amar, thanks a lot.  This is exactly the kind of feedback that I am looking for 
...  I have some more question ...

==
The jobclient while submitting the job calculates the split using
InputFormat which is specified by the user. Internally the InputFormat
might make use of dfs-block size, user-hinted num-maps etc. The
jobtracker is given 3 files
- job.xml : job control parameters
- job.split : the split file
- job.jar : user map-reduce code
==
[Ricky]  What exactly does the job.split contains ?  I assume it contains the 
specification for each split (but not its data), such as what is the 
corresponding file and the byte range within that file.  Correct ?


This process is interleaved/parallelized. As soon as a map is done, the
JobTracker is notified. Once a tracker (with a reducer) asks for events,
these new events are passed. Hence the map output pulling (Shuffle
Phase) works in parallel with the Map Phase. Reduce Phase can start only
once all the (resp) map outputs are copied and merged.
=
[Ricky]  I am curious about why can't the reduce execution start earlier 
(before all the map tasks completed).  The value iterator inside the 
used-defined reduce() method can be blocked to wait for more map tasks 
completion.  In other words, the map() and reduce() can also be proceeding in a 
pipeline parallelism.


==
There is a 1-1 mapping between a split and a map task. Hence it will
re-run the map on the corresponding split.
==
[Ricky]  Do you mean if the job has 5000 splits, then it requires 5000 
TaskTrackers VM (one for each split) ?

comment:
If the job has 5000 splits, then it requires 5000 VMs (one for each split). 
TaskTracker is a framework daemon. TaskTracker is a process (JVM) that 
handles/manages tasks (processes processing a split) on a node. A TaskTracker 
is recognized by (node-hostname + port). A task is never executed in a 
TaskTracker and new jvm is spawned. The reason being that a faulty 
usercode(map/reduce) should not bring down a TaskTracker (a framework process). 
But with hadoop-0.19 we have jvm reuse and hence 5000 splits might require  
5000 VMs. Note that tasks in the end might get speculated which might add to 
the VM count.
Amar


===
The client is unblocked once the job is submitted. The way it works is
as follows :
- jobclient requests the jobtracker for a unique job id
- jobclient does some sanity checks to see if the output folder exists
etc ...
- jobclient uploads job files (xml, jar, split) onto a known location
called System-Directory

[Ricky]  Is this a well-know folder within the HDFS ?

This is set using mapred.system.dir during cluster startup (see 
hadoop-default.conf). Its a framework directory.




Re: Lookup HashMap available within the Map

2008-11-30 Thread tim robertson
Hi Shane,

I can't explain that, but I can say that with 0.19.0 I am using
setNumTasksToExecutePerJvm(-1) and then initializing statically
declared data in the Map configure successfully now.  It really is
educated guesswork for the tuning parameters though - I am profiling
the app for memory usage locally and then from trial and error
determining how much additional I need for the Node's hadoop framework
actiities, in order to set the -Xmx params and Maps jobs per Nodes for
the different EC2 sizes.  A little dirty perhaps, but I am still
learning 
(http://biodivertido.blogspot.com/2008/11/reproducing-spatial-joins-using-hadoop.html).

I'm interested to know when one would use a MultithreadedMapRunner also.

Cheers

Tim

On Sun, Nov 30, 2008 at 11:22 PM, Shane Butler [EMAIL PROTECTED] wrote:
 Given the goal of a shared data accessable across the Map instances,
 can someone please explain some of the differences between using:
 - setNumTasksToExecutePerJvm() and then having statically declared
 data initialised in Mapper.configure(); and
 - a MultithreadedMapRunner?

 Regards,
 Shane


 On Wed, Nov 26, 2008 at 6:41 AM, Doug Cutting [EMAIL PROTECTED] wrote:
 tim robertson wrote:

 Thanks Alex - this will allow me to share the shapefile, but I need to
 one time only per job per jvm read it, parse it and store the
 objects in the index.
 Is the Mapper.configure() the best place to do this?  E.g. will it
 only be called once per job?

 In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM.
  So, yes, you could access a static cache from Mapper.configure().

 Doug