Hi!

This was originally discussed back in 2021, but changes in priorities meant
that no progress has been made on this until now, when Aron picked up and
polished the patch.

We are tracking this work in* PHOENIX-6053*.

I have tried to give a (not so) quick summary of the change below, but the
original discussion thread has more details, and I suggest reading that as
well:
https://lists.apache.org/thread/hs4klbc04n4gh62z17pznc0rkspjg6jx



*Motivation:*The huge amount of dependencies for Phoenix is an
ongoing problem.
To use the thick client, you either need to depend on it on maven, which
brings in dozens of large, complex and commonly used dependencies, or you
need to use the shaded phoenix-client artifact, which includes every
dependency, and attempts to shade everything that can be shaded.
When going the unshaded route, you need to make sure that your application
works with Phoenix's version of the dependencies, or that your
application's version doesn't break Phoenix.

When using the shaded artifact, this is less of an issue, but there are
still cases when shading doesn't help, or causes additional problems.
One such issue is you cannot have any Hadoop or Hbase libraries on the
classpath, as they fail hard when shaded and unshaded (or at least not
phoenix shaded) jars are mixed.
Another issue is  https://issues.apache.org/jira/browse/PHOENIX-6861, where
a shading change broke PQS.

The direct motivation for us was a project, which needed to use phoenix
along with other Hadoop stack component, where we couldn't use
phoenix-client because of the shading conflicts,
and we couldn't use phoenix-core because of a protobuf (2.5.0) version
conflict.


*Proposed solution:*

*STEP 1 (The current patch):*Split the current phoenix-core module into two
parts:
*phoenix-core* retains the all code that is needed for the thick client,
and excludes everything that uses either the hbase-server or mapreduce APIs.
*phoenix-coprocessors* includes everything that is not needed for the thick
client (i.e. server side code), and/or depends on hbase-server, or the
mapreduce libraries.
phoenix-coprocessors of course depends on phoenix-core, and both
*phoenix-client* and *phoenix-server *depend on both.
This is of course easier said and done, as there are a lot of circular
dependencies between these modules, which need to be broken.

*STEP 2 (Future patch):*
Introduce a new artifact, *phoenix-client-lite*, which does not
include the *phoenix-coprocessors
code*, neither its dependencies (*hbase-server, mapreduce*)
This is mostly a size optimization, last I checked it shaves ~30 megabytes
off the current phoenix-client jar size. This would be the one most
"normal" applications, including phoenix-sqlline
would add to their classpath

Introduce a new artifact, *phoenix-client-byo-hbase* which is modelled
after *hbase-client-byo-hadoop*. This one includes phoenix-core, and its
direct non-hbase dependencies,
but uses hbase (and Hadoop/MR) from the *hbase-client *or
*hbase-client-byo-hadoop
*jars*. *We need to make some changes to shading to cover the differences
between standard Hbase API and the shaded hbase-client API.
This solves the hadoop/hbase coexistence problem.

*STEP 3 (Future patch)*
To solve the various classpath issues, the current connectors use custom
shading.
The Spark connector in particular needs to coexist with hbase-client on the
Spark classpath, and requires the same shading changes that
* phoenix-client-byo-hbase* does.
The Hive connector ultimately needs to get the same treatment (see
PHOENIX-6939, we've done that downstream a long time ago).
Instead of the current custom shaded 70-80MB JARs, these could be a few
dozen kilobytes, only containing the actual connector code and depending on
*phoenix-client-byo-hbase.*Unfortunately, this is complicated by the
phoenix mapreduce code depending on hbase-server.



*Open questions:**Pacing:*
Does the three step plan above make sense, or should we split it further,
or consolidate them into one or two ?
Perhaps doing STEP 1 and 2 in patch would let us test whether the new
artifacts indeed work as expected before committing the changes.


*Naming:*Both the module names, and the package names introduced are up for
debate.
For now, we've renamed org.apache.phoenix.coprocessor to
org.apache.phoenix.coprocessorclient on the client side, to avoid having a
package named "coprocessor" in the client,
and had to invent new names for some classes, but we are open to any ideas.
As for the modules, *phoenix-client* and* phoenix-server *are already
taken, so we went with *phoenix-coprocessors*, but
better names are welcome either for these or the new shaded artifacts.


*Mapreduce:*My original idea was to split the Phoenix mapreduce code into a
third module, so that connectors can only depend on that one
and not on the phoenix-coprocessors module.
However, the snapshot handling depends on hbase-server, and IIRC there were
some other non-trivial dependencies on the
server-side code in them.
My current thought is that a separate mapreduce module is not needed.
The phoenix mapred jobs can be run with just the hbase command, which adds
hbase-server to the classpath anyway.
The connectors deal directly with HFiles, and so they are not expected to
run outside the cluster, and we can just add the extra dependencies to the
classpath i.e.:
*hbase-client (coming from Spark/Hive), phoenix-client-byo-hbase,
phoenix-coprocessors,** hbase-server *and the minimal connector JAR
*.*
I would especially welcome insight on this one.


*5.1:*My original plan was to leave 5.1 alone. However, as 5.1 is shaping
up to be a longer term supported version (at least in CLDR),
I increasingly find myself warming up to the idea of backporting these
changes.
First, it would make later backports much less of a pain.
Second, it doesn't break any public (or semi-public) API, or change
behaviour, we're just moving internal code around.
Third, we could make the new feature available faster, in 5.1.4 or 5.1.5.

Looking forward to your feedback, either on the above topics, or general.
Josh, Jacob, Daniel and Lars have contributed to the previous discussion,
and I hope to receive input both from them and from everyone else who
hasn't participated in this yet.

Best regards
Istvan

Reply via email to