[ https://issues.apache.org/jira/browse/HBASE-26909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bryan Beaudreault updated HBASE-26909: -------------------------------------- Assignee: Bryan Beaudreault Labels: patch-available (was: ) Status: Patch Available (was: Open) I did this a little backwards – attached PR is for branch-2. I have a virtually identical PR ready for master as well, but don't want to confuse things by submitting early. I can submit once we're ready. In the end I decided not to tackle the hadoop dependency issue noted above. I'll potentially file a separate Jira for that later, as I think it will be really tricky. > hbase-shaded-mapreduce and hbase-shaded-client expose some of the same classes > ------------------------------------------------------------------------------ > > Key: HBASE-26909 > URL: https://issues.apache.org/jira/browse/HBASE-26909 > Project: HBase > Issue Type: Improvement > Reporter: Bryan Beaudreault > Assignee: Bryan Beaudreault > Priority: Major > Labels: patch-available > > We supply 2 primary artifacts for end-users to consume: > * hbase-shaded-client, which is for general use > * hbase-shaded-mapreduce, which is for use when you need to connect to hbase > via mapreduce. For example, TableInputFormat > The problem is that these artifacts expose tons of duplicate classes. One > example (among many) is org.apache.hadoop.hbase.Cell, which appears in both > jars. > This may not be a problem if your projects are always very isolated – either > doing mapreduce, or not. In that case you just depend in the one you need. > Many users might exist in much more complicated environments where > dependencies tend to bleed along more between projects. Here's an > illustration: > * Imagine a project FooService, which includes two modules FooServiceRestWeb > (for the rest http resources) and FooServiceData (which includes DAOs for > accessing data). FooServiceRestWeb depends on FooServiceData to access hbase. > In this case, FooServiceData should depend on hbase-shaded-client. > * Now imagine another project FooPipeline, which has modules > FooPipelineHadoop (with M/R jobs for processing data) and FooPipelineData > (which has some DAOs for accessing data). In this case, FooPipelineData might > depend on hbase-shaded-mapreduce since the context is intended for M/R. > * The problem arises when suddenly we want to include some data from > FooService into our pipeline. The most straightforward way to achieve this is > by depending on FooServiceData, which has all of he DAOs for that data but > also depends on hbase-shaded-client. At this point you have a problem, > because FooPipelineHadoop now depends on both hbase-shaded-mapreduce and > hbase-shaded-client. > (Note, this obviously skirts around potential microservice solutions like > only accessing FooService's data through the API... it's just for > illustration, and it does come up.) > From a plain java perspective, having these 2 jars on the classpath is > somewhat wasteful but not a huge issue since the implementations are all the > same. > From a maven perspective, it's problematic because the maven dependency > plugin will complain about the conflicting classes. > One potential fix is to add exclusions to the FooServiceData dependency, to > avoid pulling in hbase-shaded-client. This works on a one-off basis but is > much more painful in a large and complicated environment where this may come > up hundreds of times. > A better fix in my opinion is to make hbase-shaded-mapreduce depend on > hbase-shaded-client and then only expose the classes that aren't already > exposed by the shaded client. > [~busbey] also mentioned a BOM being a potential solution, but I don't have > experience with that. > -- This message was sent by Atlassian Jira (v8.20.1#820001)