[ 
https://issues.apache.org/jira/browse/HBASE-26909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault updated HBASE-26909:
--------------------------------------
    Description: 
We supply 2 primary artifacts for end-users to consume:
 * hbase-shaded-client, which is for general use
 * hbase-shaded-mapreduce, which is for use when you need to connect to hbase 
via mapreduce. For example, TableInputFormat

The problem is that these artifacts expose tons of duplicate classes. One 
example (among many) is org.apache.hadoop.hbase.Cell, which appears in both 
jars.

This may not be a problem if your projects are always very isolated – either 
doing mapreduce, or not. In that case you just depend in the one you need. Many 
users might exist in much more complicated environments where dependencies tend 
to bleed along more between projects. Here's an illustration:
 * Imagine a project FooService, which includes two modules FooServiceRestWeb 
(for the rest http resources) and FooServiceData (which includes DAOs for 
accessing data). FooServiceRestWeb depends on FooServiceData to access hbase.  
In this case, FooServiceData should depend on hbase-shaded-client.
 * Now imagine another project FooPipeline, which has modules FooPipelineHadoop 
(with M/R jobs for processing data) and FooPipelineData (which has some DAOs 
for accessing data). In this case, FooPipelineData might depend on 
hbase-shaded-mapreduce since the context is intended for M/R.
 * The problem arises when suddenly we want to include some data from 
FooService into our pipeline. The most straightforward way to achieve this is 
by depending on FooServiceData,  which has all of he DAOs for that data but 
also depends on hbase-shaded-client. At this point you have a problem, because 
FooPipelineHadoop now depends on both hbase-shaded-mapreduce and 
hbase-shaded-client.

(Note, this obviously skirts around potential microservice solutions like only 
accessing FooService's data through the API... it's just for illustration, and 
it does come up.)

>From a plain java perspective, having these 2 jars on the classpath is 
>somewhat wasteful but not a huge issue since the implementations are all the 
>same.

>From a maven perspective, it's problematic because the maven dependency plugin 
>will complain about the conflicting classes.

One potential fix is to add exclusions to the FooServiceData dependency, to 
avoid pulling in hbase-shaded-client. This works on a one-off basis but is much 
more painful in a large and complicated environment where this may come up 
hundreds of times.

A better fix in my opinion is to make hbase-shaded-mapreduce depend on 
hbase-shaded-client and then only expose the classes that aren't already 
exposed by the shaded client.

[~busbey] also mentioned a BOM being a potential solution, but I don't have 
experience with that.

 

  was:
We supply 2 primary artifacts for end-users to consume:
 * hbase-shaded-client, which is for general use
 * hbase-shaded-mapreduce, which is for use when you need to connect to hbase 
via mapreduce. For example, TableInputFormat

The problem is that these artifacts expose tons of duplicate classes. One 
example (among many) is org.apache.hadoop.hbase.Cell, which appears in both 
jars.

This may not be a problem if your projects are always very isolated – either 
doing mapreduce, or not. In that case you just depend in the one you need. Many 
users might exist in much more complicated environments where dependencies tend 
to bleed along more between projects. Here's an illustration:

Imagine a project FooService, which includes two modules FooServiceRestWeb (for 
the rest http resources) and FooServiceData (which includes DAOs for accessing 
data). FooServiceRestWeb depends on FooServiceData to access hbase.  In this 
case, FooServiceData should depend on hbase-shaded-client.

Now imagine another project FooPipeline, which has modules FooPipelineHadoop 
(with M/R jobs for processing data) and FooPipelineData (which has some DAOs 
for accessing data). In this case, FooPipelineData might depend on 
hbase-shaded-mapreduce since the context is intended for M/R.

The problem arises when suddenly we want to include some data from FooService 
into our pipeline. The most straightforward way to achieve this is by depending 
on FooServiceData,  which has all of he DAOs for that data but also depends on 
hbase-shaded-client. At this point you have a problem, because 
FooPipelineHadoop now depends on both hbase-shaded-mapreduce and 
hbase-shaded-client.

(Note, this obviously skirts around potential microservice solutions like only 
accessing FooService's data through the API... it's just for illustration, and 
it does come up.)

>From a plain java perspective, having these 2 jars on the classpath is 
>somewhat wasteful but not a huge issue since the implementations are all the 
>same.

>From a maven perspective, it's problematic because the maven dependency plugin 
>will complain about the conflicting classes.

One potential fix is to add exclusions to the FooServiceData dependency, to 
avoid pulling in hbase-shaded-client. This works on a one-off basis but is much 
more painful in a large and complicated environment where this may come up 
hundreds of times.

A better fix in my opinion is to make hbase-shaded-mapreduce depend on 
hbase-shaded-client and then only expose the classes that aren't already 
exposed by the shaded client.

[~busbey] also mentioned a BOM being a potential solution, but I don't have 
experience with that.

 


> hbase-shaded-mapreduce and hbase-shaded-client expose some of the same classes
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-26909
>                 URL: https://issues.apache.org/jira/browse/HBASE-26909
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Bryan Beaudreault
>            Priority: Major
>
> We supply 2 primary artifacts for end-users to consume:
>  * hbase-shaded-client, which is for general use
>  * hbase-shaded-mapreduce, which is for use when you need to connect to hbase 
> via mapreduce. For example, TableInputFormat
> The problem is that these artifacts expose tons of duplicate classes. One 
> example (among many) is org.apache.hadoop.hbase.Cell, which appears in both 
> jars.
> This may not be a problem if your projects are always very isolated – either 
> doing mapreduce, or not. In that case you just depend in the one you need. 
> Many users might exist in much more complicated environments where 
> dependencies tend to bleed along more between projects. Here's an 
> illustration:
>  * Imagine a project FooService, which includes two modules FooServiceRestWeb 
> (for the rest http resources) and FooServiceData (which includes DAOs for 
> accessing data). FooServiceRestWeb depends on FooServiceData to access hbase. 
>  In this case, FooServiceData should depend on hbase-shaded-client.
>  * Now imagine another project FooPipeline, which has modules 
> FooPipelineHadoop (with M/R jobs for processing data) and FooPipelineData 
> (which has some DAOs for accessing data). In this case, FooPipelineData might 
> depend on hbase-shaded-mapreduce since the context is intended for M/R.
>  * The problem arises when suddenly we want to include some data from 
> FooService into our pipeline. The most straightforward way to achieve this is 
> by depending on FooServiceData,  which has all of he DAOs for that data but 
> also depends on hbase-shaded-client. At this point you have a problem, 
> because FooPipelineHadoop now depends on both hbase-shaded-mapreduce and 
> hbase-shaded-client.
> (Note, this obviously skirts around potential microservice solutions like 
> only accessing FooService's data through the API... it's just for 
> illustration, and it does come up.)
> From a plain java perspective, having these 2 jars on the classpath is 
> somewhat wasteful but not a huge issue since the implementations are all the 
> same.
> From a maven perspective, it's problematic because the maven dependency 
> plugin will complain about the conflicting classes.
> One potential fix is to add exclusions to the FooServiceData dependency, to 
> avoid pulling in hbase-shaded-client. This works on a one-off basis but is 
> much more painful in a large and complicated environment where this may come 
> up hundreds of times.
> A better fix in my opinion is to make hbase-shaded-mapreduce depend on 
> hbase-shaded-client and then only expose the classes that aren't already 
> exposed by the shaded client.
> [~busbey] also mentioned a BOM being a potential solution, but I don't have 
> experience with that.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to