Best practice for using third party libraries in MapReduce Jobs?

2008-12-03 Thread Scott Whitecross
What's the best way to use third party libraries with Hadoop?  For  
example, I want to run a job with both a jar file containing the mob,  
and also extra libraries.  I noticed a couple solutions with a search,  
but I'm hoping for something better:


- Merge the third party jar libraries into the job jar
- Distribute the third party libraries across the cluster in the local  
boxes classpath.


What I'd really like is a way to add an extra option to the hadoop jar  
command, such as hadoop/bin/hadoop jar myJar.jar myJobClass -classpath  
thirdpartyjar1.jar:jar2.jar:etc  args


Anything exist like this?


Re: Best practice for using third party libraries in MapReduce Jobs?

2008-12-03 Thread tim robertson
Can't answer your question exactly, but can let you know what I do.

I build all dependencies into 1 jar, and by using Maven for my build
environment, when I assemble my jar, I am 100% sure all my
dependencies are collected together.  This is working very nicely for
me and I have used the same scripts for around 20 different jars that
I run on EC2 - each had different dependencies which would have been a
pain to manage seperately, but maven simplifies this massively.

Let me know if you want any of my maven config for assembly etc if you
are a maven user...

Cheers,

Tim


On Wed, Dec 3, 2008 at 3:19 PM, Scott Whitecross [EMAIL PROTECTED] wrote:
 What's the best way to use third party libraries with Hadoop?  For example,
 I want to run a job with both a jar file containing the mob, and also extra
 libraries.  I noticed a couple solutions with a search, but I'm hoping for
 something better:

 - Merge the third party jar libraries into the job jar
 - Distribute the third party libraries across the cluster in the local boxes
 classpath.

 What I'd really like is a way to add an extra option to the hadoop jar
 command, such as hadoop/bin/hadoop jar myJar.jar myJobClass -classpath
 thirdpartyjar1.jar:jar2.jar:etc  args

 Anything exist like this?



Re: Best practice for using third party libraries in MapReduce Jobs?

2008-12-03 Thread Johannes Zillmann
You could use the DistributedCache to put multiple jar's into the  
classpath. Of cause you would have to write your own job-submission  
logic for that


Johannes

On Dec 3, 2008, at 3:19 PM, Scott Whitecross wrote:

What's the best way to use third party libraries with Hadoop?  For  
example, I want to run a job with both a jar file containing the  
mob, and also extra libraries.  I noticed a couple solutions with a  
search, but I'm hoping for something better:


- Merge the third party jar libraries into the job jar
- Distribute the third party libraries across the cluster in the  
local boxes classpath.


What I'd really like is a way to add an extra option to the hadoop  
jar command, such as hadoop/bin/hadoop jar myJar.jar myJobClass - 
classpath thirdpartyjar1.jar:jar2.jar:etc  args


Anything exist like this?



~~~
101tec GmbH
Halle (Saale), Saxony-Anhalt, Germany
http://www.101tec.com



Re: Best practice for using third party libraries in MapReduce Jobs?

2008-12-03 Thread tim robertson
Exactly.  I'm no expert on maven either, but I like it's convenience for
classpath handling

Attached are my scripts.
- Hadoop-installer allows me to install different versions of hadoop
to local repo
- Pom has an assembly plugin (change mainClass and packageName to be
your target)
- Assembly does the packaging.  Run it with
 - mvn assembly:assembly -Dmaven.test.skip=true

The way I work is manage all dependencies in the pom, use mvn
eclipse:eclipse to keep eclipse buildpath correct.  Then I just run
everything in Eclipse with small input files until I am happy that it
works.  Then I build the jar with dependencies and copy it up to EC2
to run on the cluster.  Might not be the best way but seems fairly
efficient for me.

Cheers,

Tim


On Wed, Dec 3, 2008 at 10:42 PM, Scott Whitecross [EMAIL PROTECTED] wrote:
 Thanks Tim.

 We use Maven, though I'm not an expert on it.  Basically you are using Maven
 to take the dependencies, and package them in one large jar?  Basically
 unjar the contents of the jar and use those with your code I'm assuming?


 On Dec 3, 2008, at 9:25 AM, tim robertson wrote:

 Can't answer your question exactly, but can let you know what I do.

 I build all dependencies into 1 jar, and by using Maven for my build
 environment, when I assemble my jar, I am 100% sure all my
 dependencies are collected together.  This is working very nicely for
 me and I have used the same scripts for around 20 different jars that
 I run on EC2 - each had different dependencies which would have been a
 pain to manage seperately, but maven simplifies this massively.

 Let me know if you want any of my maven config for assembly etc if you
 are a maven user...

 Cheers,

 Tim


 On Wed, Dec 3, 2008 at 3:19 PM, Scott Whitecross [EMAIL PROTECTED] wrote:

 What's the best way to use third party libraries with Hadoop?  For
 example,
 I want to run a job with both a jar file containing the mob, and also
 extra
 libraries.  I noticed a couple solutions with a search, but I'm hoping
 for
 something better:

 - Merge the third party jar libraries into the job jar
 - Distribute the third party libraries across the cluster in the local
 boxes
 classpath.

 What I'd really like is a way to add an extra option to the hadoop jar
 command, such as hadoop/bin/hadoop jar myJar.jar myJobClass -classpath
 thirdpartyjar1.jar:jar2.jar:etc  args

 Anything exist like this?




assembly
  idjar-with-dependencies/id
  formats
formatjar/format
  /formats
  includeBaseDirectoryfalse/includeBaseDirectory
  fileSets
fileSet
  directorytarget/classes/directory
  outputDirectory//outputDirectory
/fileSet
  /fileSets
  dependencySets
dependencySet
  outputDirectory//outputDirectory
  unpacktrue/unpack
  scoperuntime/scope
  excludes
!-- Do not include hadoop or log4j --
excludeorg.apache.hadoop:hadoop-core/exclude
excludelog4j:log4j/exclude
  /excludes
/dependencySet
  /dependencySets
/assembly!-- 
	A little script to install hadoop.  It requires that hadoop-XXX-core.jar is in the same directory and run as:
	mvn -f hadoop-installer.xml install  -Dhadoop.version=0.19.0 -Dmaven.test.skip=true
--
project xmlns=http://maven.apache.org/POM/4.0.0; xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation=http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd;
  modelVersion4.0.0/modelVersion
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop/artifactId
  nameHadoop (${hadoop.version})/name
  packagingjar/packaging
  version${hadoop.version}/version
  build
plugins
  plugin
artifactIdmaven-install-plugin/artifactId
executions
  execution
idinstall-hadoop/id
phaseinstall/phase
goals
  goalinstall-file/goal
/goals
configuration
  filehadoop-${hadoop.version}-core.jar/file
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-core/artifactId
  packagingjar/packaging
  version${hadoop.version}/version
  generatePomtrue/generatePom
  createChecksumtrue/createChecksum
/configuration
  /execution
/executions
  /plugin
/plugins
  /build
/projectproject xmlns=http://maven.apache.org/POM/4.0.0;
	xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
	xsi:schemaLocation=http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd;
	modelVersion4.0.0/modelVersion
	groupIdcom.ibiodiversity.index/groupId
	artifactIdindex-hadoop/artifactId
	packagingjar/packaging
	version1.0-SNAPSHOT/version
	namehadoop-server/name
	urlhttp://maven.apache.org/url
	repositories
		repository
			idgeotools/id
			urlhttp://maven.geotools.fr/repository/url
		/repository
		repository
			idcentral/id
			urlhttp://repo1.maven.org/maven2/url
		/repository
		repository
			idappfuse/id
			urlhttp://static.appfuse.org/repository/url
		/repository