Re: What's the best practice for managing Hadoop dependencie?

Stanley Shi Sun, 09 Mar 2014 20:58:25 -0700

Waiting for others to give best practice.

I think you can use eclipse to manage the maven; see the full dependency
hierarchy, if some jar(for example, guava) exists in both hadoop dependency
chain and your own requirements, put your requirements' scope as "provided"
.


Regards,
*Stanley Shi,*



On Mon, Mar 10, 2014 at 11:33 AM, Fengyun RAO <raofeng...@gmail.com> wrote:

> First of all, I want to claim that I used CDH5 beta, and managed project
> using maven, and I googled and read a lot, e.g.
> https://issues.apache.org/jira/browse/MAPREDUCE-1700
>
> http://www.datasalt.com/2011/05/handling-dependencies-and-configuration-in-java-hadoop-projects-efficiently/
>
> I believe the problem is quite common, when we write an MR job, we need
> lots of dependencies,
> which may not exist in or conflict with HADDOP_CLASSPATH.
> There are several options, e.g.
> 1. add all libraries to my own JAR, and set
> HADOOP_USER_CLASSPATH_FIRST=true
>    This is what I do, which makes the jar very big, and still it doesn't
> work.
>    e.g. I already packaged guava-16.0.jar in my package, but it still use
> guava-11.0.2.jar in the HADDOP_CLASSPATH.
>    below is my build configuration.
>             <plugin>
>                 <artifactId>maven-assembly-plugin</artifactId>
>                 <configuration>
>                     <archive>
>                         <manifest>
>                             <mainClass>xxx.xxx.xxx.Runner</mainClass>
>                         </manifest>
>                     </archive>
>                     <descriptorRefs>
>
> <descriptorRef>jar-with-dependencies</descriptorRef>
>                     </descriptorRefs>
>                 </configuration>
>                 <executions>
>                     <execution>
>                         <id>make-assembly</id>
>                         <phase>package</phase>
>                         <goals>
>                             <goal>single</goal>
>                         </goals>
>                     </execution>
>                 </executions>
>             </plugin>
>
> 2. distinguish which library is not present in HADDOP_CLASSPATH, and put
> it into DistributedCache
>     I think it's hard to distinguish, and still if it conflicts, which
> dependency would be precedent?
>
>
> *What's the best practice, especially using maven?*
>
>
>

Re: What's the best practice for managing Hadoop dependencie?

Reply via email to