[ 
https://issues.apache.org/jira/browse/HBASE-20332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated HBASE-20332:
--------------------------------
    Status: Patch Available  (was: In Progress)

Okay, fell down a bit of a rabbit hole with this one, so here's what I have so 
far. I'm testing this on a cluster now, so might end up with more changes. 
Feedback on the direction of the approach so far please.

Note that I've stumbled on to what seems to be a bug in the maven-shade-plugin: 
activation clauses for profiles are stripped in the dependency-reduced-pom. 
That means that rather than default to seeing the hadoop 2 profile's provided 
dependencies, our shaded mapreduce artifact defaults to just not showing any of 
the provided scope hadoop dependencies.

To see the actual dependency tree/list for use with a particular hadoop, you 
have to manually activate the relevant profile, e.g. {{mvn -Phadoop-2.0 
dependency:tree -f 
/path/to/maven/repo/org/apache/hbase/hbase-shaded-mapreduce/3.0.0-SNAPSHOT/hbase-shaded-mapreduce-3.0.0-SNAPSHOT.pom}}.

I think this is fine since the vast majority of users will not programmatically 
look at the pom to figure out specific jars to get from the environment, given 
that our expressed goal usage is via the Hadoop commands.

-v0
 * modify the jar checking script to take args; make hadoop stuff optional
 * separate out checking the artifacts that have hadoop vs those that don't.
 ** Unfortunately means we need two modules for checking things
 ** put in a safety check that the support script for checking jar contents is 
maintained in both modules
 * move hadoop deps for the mapreduce module to provided. we should be getting 
stuff from hadoop at runtime for the non-shaded artifact as well.
 ** have to carve out an exception for o.a.hadoop.metrics2. :(
 * fix duplicated class warning

> shaded mapreduce module shouldn't include hadoop
> ------------------------------------------------
>
>                 Key: HBASE-20332
>                 URL: https://issues.apache.org/jira/browse/HBASE-20332
>             Project: HBase
>          Issue Type: Sub-task
>          Components: mapreduce, shading
>    Affects Versions: 2.0.0
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>            Priority: Critical
>             Fix For: 2.0.0
>
>         Attachments: HBASE-20332.0.patch
>
>
> AFAICT, we should just entirely skip including hadoop in our shaded mapreduce 
> module
> 1) Folks expect to run yarn / mr apps via {{hadoop jar}} / {{yarn jar}}
> 2) those commands include all the needed Hadoop jars in your classpath by 
> default (both client side and in the containers)
> 3) If you try to use "user classpath first" for your job as a workaround 
> (e.g. for some library your application needs that hadoop provides) then our 
> inclusion of *some but not all* hadoop classes then causes everything to fall 
> over because of mixing rewritten and non-rewritten hadoop classes
> 4) if you don't use "user classpath first" then all of our 
> non-relocated-but-still-shaded hadoop classes are ignored anyways so we're 
> just wasting space



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to