[ 
https://issues.apache.org/jira/browse/HBASE-26909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault updated HBASE-26909:
--------------------------------------
    Assignee: Bryan Beaudreault
      Labels: patch-available  (was: )
      Status: Patch Available  (was: Open)

I did this a little backwards – attached PR is for branch-2. I have a virtually 
identical PR ready for master as well, but don't want to confuse things by 
submitting early. I can submit once we're ready.

In the end I decided not to tackle the hadoop dependency issue noted above. 
I'll potentially file a separate Jira for that later, as I think it will be 
really tricky.

> hbase-shaded-mapreduce and hbase-shaded-client expose some of the same classes
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-26909
>                 URL: https://issues.apache.org/jira/browse/HBASE-26909
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Bryan Beaudreault
>            Assignee: Bryan Beaudreault
>            Priority: Major
>              Labels: patch-available
>
> We supply 2 primary artifacts for end-users to consume:
>  * hbase-shaded-client, which is for general use
>  * hbase-shaded-mapreduce, which is for use when you need to connect to hbase 
> via mapreduce. For example, TableInputFormat
> The problem is that these artifacts expose tons of duplicate classes. One 
> example (among many) is org.apache.hadoop.hbase.Cell, which appears in both 
> jars.
> This may not be a problem if your projects are always very isolated – either 
> doing mapreduce, or not. In that case you just depend in the one you need. 
> Many users might exist in much more complicated environments where 
> dependencies tend to bleed along more between projects. Here's an 
> illustration:
>  * Imagine a project FooService, which includes two modules FooServiceRestWeb 
> (for the rest http resources) and FooServiceData (which includes DAOs for 
> accessing data). FooServiceRestWeb depends on FooServiceData to access hbase. 
>  In this case, FooServiceData should depend on hbase-shaded-client.
>  * Now imagine another project FooPipeline, which has modules 
> FooPipelineHadoop (with M/R jobs for processing data) and FooPipelineData 
> (which has some DAOs for accessing data). In this case, FooPipelineData might 
> depend on hbase-shaded-mapreduce since the context is intended for M/R.
>  * The problem arises when suddenly we want to include some data from 
> FooService into our pipeline. The most straightforward way to achieve this is 
> by depending on FooServiceData,  which has all of he DAOs for that data but 
> also depends on hbase-shaded-client. At this point you have a problem, 
> because FooPipelineHadoop now depends on both hbase-shaded-mapreduce and 
> hbase-shaded-client.
> (Note, this obviously skirts around potential microservice solutions like 
> only accessing FooService's data through the API... it's just for 
> illustration, and it does come up.)
> From a plain java perspective, having these 2 jars on the classpath is 
> somewhat wasteful but not a huge issue since the implementations are all the 
> same.
> From a maven perspective, it's problematic because the maven dependency 
> plugin will complain about the conflicting classes.
> One potential fix is to add exclusions to the FooServiceData dependency, to 
> avoid pulling in hbase-shaded-client. This works on a one-off basis but is 
> much more painful in a large and complicated environment where this may come 
> up hundreds of times.
> A better fix in my opinion is to make hbase-shaded-mapreduce depend on 
> hbase-shaded-client and then only expose the classes that aren't already 
> exposed by the shaded client.
> [~busbey] also mentioned a BOM being a potential solution, but I don't have 
> experience with that.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to