Exactly. I'm no expert on maven either, but I like it's convenience for
classpath handling
Attached are my scripts.
- Hadoop-installer allows me to install different versions of hadoop
to local repo
- Pom has an assembly plugin (change mainClass and packageName to be
your target)
- Assembly does the packaging. Run it with
- mvn assembly:assembly -Dmaven.test.skip=true
The way I work is manage all dependencies in the pom, use mvn
eclipse:eclipse to keep eclipse buildpath correct. Then I just run
everything in Eclipse with small input files until I am happy that it
works. Then I build the jar with dependencies and copy it up to EC2
to run on the cluster. Might not be the best way but seems fairly
efficient for me.
Cheers,
Tim
On Wed, Dec 3, 2008 at 10:42 PM, Scott Whitecross [EMAIL PROTECTED] wrote:
Thanks Tim.
We use Maven, though I'm not an expert on it. Basically you are using Maven
to take the dependencies, and package them in one large jar? Basically
unjar the contents of the jar and use those with your code I'm assuming?
On Dec 3, 2008, at 9:25 AM, tim robertson wrote:
Can't answer your question exactly, but can let you know what I do.
I build all dependencies into 1 jar, and by using Maven for my build
environment, when I assemble my jar, I am 100% sure all my
dependencies are collected together. This is working very nicely for
me and I have used the same scripts for around 20 different jars that
I run on EC2 - each had different dependencies which would have been a
pain to manage seperately, but maven simplifies this massively.
Let me know if you want any of my maven config for assembly etc if you
are a maven user...
Cheers,
Tim
On Wed, Dec 3, 2008 at 3:19 PM, Scott Whitecross [EMAIL PROTECTED] wrote:
What's the best way to use third party libraries with Hadoop? For
example,
I want to run a job with both a jar file containing the mob, and also
extra
libraries. I noticed a couple solutions with a search, but I'm hoping
for
something better:
- Merge the third party jar libraries into the job jar
- Distribute the third party libraries across the cluster in the local
boxes
classpath.
What I'd really like is a way to add an extra option to the hadoop jar
command, such as hadoop/bin/hadoop jar myJar.jar myJobClass -classpath
thirdpartyjar1.jar:jar2.jar:etc args
Anything exist like this?
assembly
idjar-with-dependencies/id
formats
formatjar/format
/formats
includeBaseDirectoryfalse/includeBaseDirectory
fileSets
fileSet
directorytarget/classes/directory
outputDirectory//outputDirectory
/fileSet
/fileSets
dependencySets
dependencySet
outputDirectory//outputDirectory
unpacktrue/unpack
scoperuntime/scope
excludes
!-- Do not include hadoop or log4j --
excludeorg.apache.hadoop:hadoop-core/exclude
excludelog4j:log4j/exclude
/excludes
/dependencySet
/dependencySets
/assembly!--
A little script to install hadoop. It requires that hadoop-XXX-core.jar is in the same directory and run as:
mvn -f hadoop-installer.xml install -Dhadoop.version=0.19.0 -Dmaven.test.skip=true
--
project xmlns=http://maven.apache.org/POM/4.0.0; xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation=http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd;
modelVersion4.0.0/modelVersion
groupIdorg.apache.hadoop/groupId
artifactIdhadoop/artifactId
nameHadoop (${hadoop.version})/name
packagingjar/packaging
version${hadoop.version}/version
build
plugins
plugin
artifactIdmaven-install-plugin/artifactId
executions
execution
idinstall-hadoop/id
phaseinstall/phase
goals
goalinstall-file/goal
/goals
configuration
filehadoop-${hadoop.version}-core.jar/file
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-core/artifactId
packagingjar/packaging
version${hadoop.version}/version
generatePomtrue/generatePom
createChecksumtrue/createChecksum
/configuration
/execution
/executions
/plugin
/plugins
/build
/projectproject xmlns=http://maven.apache.org/POM/4.0.0;
xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
xsi:schemaLocation=http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd;
modelVersion4.0.0/modelVersion
groupIdcom.ibiodiversity.index/groupId
artifactIdindex-hadoop/artifactId
packagingjar/packaging
version1.0-SNAPSHOT/version
namehadoop-server/name
urlhttp://maven.apache.org/url
repositories
repository
idgeotools/id
urlhttp://maven.geotools.fr/repository/url
/repository
repository
idcentral/id
urlhttp://repo1.maven.org/maven2/url
/repository
repository
idappfuse/id
urlhttp://static.appfuse.org/repository/url
/repository