(The dependency PDF is attached to the ticket, please see it there.)

The separation is coming along nicely, I was able to run a successful e2e
test with the separated branch and the spark connector.

This got me thinking about how we handle the dependent projects (connectors
and queryserver), and what would be the best way to go forward.

The use cases I think we need to cover:

- HBase server components for HBase classpath (no change)
- Standalone JDBC driver (for sqlline, gui clients, and apps that do not
include other components of the Hadoop stack)
- JDBC driver that works with existing HBase client on classpath
- JDBC driver that works with existing shaded HBase client on classpath

- mapreduce functionality for connectors that works with unshaded HBase
- mapreduce functionality for connectors that works with shaded HBase

- running the Phoenix mapreduce tools

Updated proposal for shaded phoenix artifacts to generate:

*phoenix-server*: (no change) phoenix-client, phoenix-mapreduce, and
phoenix-connectors, with their own dependencies, but without the hadoop and
hbase dependencies and phoenix-hbase-compat.

*phoenix-client*: (no change) phoenix-client, phoenix-mapreduce, and
phoenix-connectors, with all dependencies, including logging and
phoenix-hbase-compat.
I don't see the use case for it, but we may need to keep it for backwards
compatibility, or use this name for phoenix-client-lite.

*phoenix-client-embedded*: (no change) phoenix-client, phoenix-mapreduce,
and phoenix-connectors, with all dependencies, except logging and
phoenix-hbase-compat.
I don't see the use case for it, but we may need to keep it for backwards
compatibility, or use this name for phoenix-client-lite-embedded.

*phoenix-client-lite-embedded* : phoenix-core, and all of its dependencies,
without the slf4j-logj4 backend and log4j libraries, and without
phoenix-hbase-compat.
For connectors, and applications where slf4j is already used and set up.
The benefit of the artifact is also somewhat dubious. In most cases it can
be replaced with phoenix-client-byo-hbase + hbase-shaded-client +
phoenix-hbase-compat-X

*phoenix-client-lite* : phoenix-core, and all of its dependencies including
hbase and hadoop clients jars, with the slf4j-logj4 backend and log4j
libraries, and phoenix-hbase-compat.
For gui clients, and applications that do not have their own logging setup,
or do not use slf4j.

*phoenix-client-byo-hbase*: phoenix-client-lite-embedded, without the hbase
and hadoop dependencies,
For connectors where Hadoop and HBase is already on the classpath, or apps
that also use HBase/Hadoop directly.

*phoenix-client-byo-shaded-hbase*: phoenix-client-byo-hbase, with the
necessary relocations to work with hbase-shade-client.jar
For connectors where Hadoop and shaded HBase is already on the classpath,
or apps that also use (shaded) HBase directly.

*phoenix-mapreduce-for-shaded-hbase*: phoenix-mapreduce, with some
relocations to work with hbase-shaded-mapreduce.jar. Does not include any
dependencies.
Standard phoenix-mapreduce depends on hbase-mapreduce (plus dependencies),
and phoenix-client-byo-hbase.
If you use phoenix-client-byo-shaded-hbase, then you need the same
relocations in phoenix-mapreduce as you do
in phoenix-client-byo-shaded-hbase to handle the API
differences.

Note that in this proposal only
*phoenix-client-lite, phoenix-client-lite-embedded* and the legacy*
phoenix-client
*and *phoenix-client-embedded* (the artifacts that include Hadoop+Hbase)
would include phoenix-hbase-compat, and would have binaries pre-built for
all hbase profiles.
For all other use cases, including phoenix-server, the necessary
phoenix-hbase-compat-X jar would need to be explicitly added.

It is also possible to leave every existing shaded artifact as is, and
simply add the new ones, which would minimize the disruption to existing
workflows/scripts, but would further inflate the assembly size and maven
artifacts.

I also think that we should stop building shaded connector artifacts, and
instead document which phoenix client variant, and what additional
dependencies are needed to get
each connector working (See below)


*QueryServer:*
As a standalone process, without the requirement to work with other
components, it can run with almost anything.
The legacy phoenix-client, phoenix-client-lite, or even phoenix-core is
fine, as long as we add the necessary dependencies to its classpath.

*Spark:*
Spark already has Hadoop on the classpath, but it doesn't include HBase.
The official HBase connector depends on hbase-shaded mapreduce.
To interoperate with this, we need something that can work on top of the
HBase connector and its dependencies:
*phoenix-client-byo-shaded-hbase *+ phoenix-hbase-compat *+
phoenix-mapreduce-for-shaded-hbase + phoenix5-spark*

*Hive:*
Hive is a big headache, because as of 3.x it includes an ancient HBase
version by default, which conflicts with everything.
The issue (and a partial fix for Hive 4.0) is detailed in HIVE-24473.
Assuming that the problem is somehow fixed
(by patching and recompiling Hive, or by manually replacing the HBase
libraries), we have both Hadoop and the non-shaded HBase libraries on the
classpath.
In this case the phoenix connector would consist of:
*phoenix-client-byo-hbase *+ phoenix-hbase-compat* + phoenix-mapreduce +
phoenix5-hive *

*Kafka:*
The kafka connector is so hopelessly out of date that it's not even worth
considering now.

*Flume:*
I haven't got the foggiest idea how Flume composes its components, and how
the classpath is built, or even if the components run in the same JVM.
in any case, one of
*phoenix-client-lite-embedded + phoenix-mapreduce + phoenix5-flume*
*phoenix-client-byo-hbase* + phoenix-hbase-compat* + phoenix-mapreduce +
phoenix5-flume*  + phoenix-hbase-compat or
*phoenix-client-byo-shaded-hbase* + phoenix-hbase-compat* +
phoenix-mapreduce-for-shaded-hbase
+ phoenix5-flume *+ phoenix-hbase-compat
should work.

*Trino:*
There are people here who actually understand Trino and its connector
classpath, unlike me.
I get the impression that Trino doesn't use the phoenix mapreduce
functionality, so I expect that
*phoenix-client-lite-embedded *+ phoenix-hbase-compat should work


I hope some of you have made it this far.

I would really like to know your opinion:

Do you know any additional use cases that we should support ?
Do the above artifacts cover every known use case ?
Should we keep the legacy clients ? Keeping them is not an effort, but they
do take many minutes to build, and hundreds of megabytes to store.
What should the phoenix assembly contain ?
Should we replace phoenix-client with phoenix-client-lite in sqlline ?
Is the requirement to add phoenix-hbase-compat to the HBase classpath
separately a problem ?
Is leaving phoenix-hbase-compat  out from the phoenix-server jar a problem
when running mapreduce jobs with the HBase command ? (Maybe add a helper
script as suggested by Josh ?)
What should the connectors assembly contain ? (I think that only the
unshaded connector Jars)


regards
Istvan

On Tue, Apr 20, 2021 at 2:46 AM Josh Elser <els...@apache.org> wrote:

> Istvan -- the mailing list stripped your attachment off, I believe :).
>
> IIRC, Istvan's suggestion paves the way to make this (further)
> separation easier. With the changes he's proposing, we could further
> split the common module out into distinct pieces, and reduce what
> phoenix "server" requires.
>
> On 4/18/21 9:13 PM, la...@apache.org wrote:
> > There is also another angle to look at. A long time ago I wrote this:
> >
> > "
> > It seems Phoenix serves 4 distinct purposes:
> > 1. Query parsing and compiling.
> > 2. A type system
> > 3. Query execution
> > 4. Efficient HBase interface
> >
> > Each of these is useful by itself, but we do not expose these as stable
> interfaces.
> > We have seen a lot of need to tie HBase into "higher level" service,
> such as Spark (and Presto, etc).
> > I think we can get a long way if we separate at least #1 (SQL) from the
> rest #2, #3, and #4 (Typed HBase Interface - THI).
> > Phoenix is used via SQL (#1), other tools such as Presto, Impala, Drill,
> Spark, etc, can interface efficiently with HBase via THI (#2, #3, and #4).
> > "
> >
> > I still believe this is an additional useful demarcation for how to
> group the code. And coincided somewhat with server/client.
> >
> > Query parsing and the type system are client. Query execution and HBase
> interface are both client and server.
> >
> > -- Lars
> >
> > On Wednesday, April 14, 2021, 8:56:08 AM PDT, Istvan Toth <
> st...@apache.org> wrote:
> >
> >
> >
> >
> >
> > Jacob, Josh and me had a discussion about the topic.
> >
> > I'm attaching the dependency graph of the proposed modules
> >
> >
> >
> > On Fri, Apr 9, 2021 at 6:30 AM Istvan Toth <st...@cloudera.com> wrote:
> >> The bulk of the changes I'm working on is indeed the separation of the
> client and the server side code.
> >>
> >> Separating the MR related classes, and the tools-specific code (main,
> options parsing, etc) makes sense to me, if we don't mind adding another
> module.
> >>
> >> In the first WIP iteration, I'm splitting out everything that depends
> on more than hbase-client into a "server" module.
> >> Once that works I will look at splitting that further into a  real
> "server" and an "MR/tools" module.
> >>
> >>
> >> My initial estimates about splitting the server side code were way too
> optimistic, we have to touch a lot of code to break circular dependencies
> between the client and server side. The changes are still quite trivial,
> but the patch is going to be huge and scary.
> >>
> >>
> >> Tests are also going to be a problem, we're probably going to have to
> move most of them into the "server" or a separate "tests" module, as the
> MiniCluster tests depend on code from each module.
> >>
> >> The plan in PHOENIX-5483, and Lars's mail sounds good, but I think that
> it would be more about dividing the "client-side" module further.
> >> (BTW I think that making the indexing engine available separately would
> also be a popular feature )
> >>
> >>
> >>
> >> On Fri, Apr 9, 2021 at 5:39 AM Daniel Wong <dbw...@apache.org> wrote:
> >>> This is another project I am interested in as well as my group at
> >>> Salesforce.  We have had some discussions internally on this but I
> wasn't
> >>> aware of this specific Spark issue (We only allow phoenix access via
> spark
> >>> by default).  I think the approaches outlined are a good initial step
> but
> >>> we were also considering a larger breakup of phoenix-core.  I don't
> >>> think the desire for the larger step should stop us from doing the
> initial
> >>> ones Istavan and Josh proposed.  I think the high level plan makes
> sense
> >>> but I might prefer a different name than phoenix-tools for the ones we
> want
> >>> to be available to external libraries like phoenix-connectors.  Another
> >>> possible alternative is to restructure maybe less invasively by making
> >>> phoenix core like your proposed tools and making a phoenix-internal or
> >>> similar for the future.
> >>> One thing I was wondering was how much effort it was to split
> client/server
> >>> through phoenix-core...  Lars layed out a good component view of
> phoenix
> >>> whosethe first step might be PHOENIx-5483 but we could focus on highest
> >>> level separation rather than bottom up.  However, even that thread
> linked
> >>> there talks about a client-facing api which we can piggyback for this
> use.
> >>> Say phoeinx-public-api or similar.
> >>>
> >>> On Wed, Apr 7, 2021 at 9:43 AM Jacob Isaac <jacobpisaa...@gmail.com>
> wrote:
> >>>
> >>>> Hi Josh & Istvan
> >>>>
> >>>> Thanks Istvan for looking into this, I am also interested in solving
> this
> >>>> problem,
> >>>> Let me know how I can help?
> >>>>
> >>>> Thanks
> >>>> Jacob
> >>>>
> >>>> On Wed, Apr 7, 2021 at 9:05 AM Josh Elser <els...@apache.org> wrote:
> >>>>
> >>>>> Thanks for trying to tackle this sticky problem, Istvan. For the
> context
> >>>>> of everyone else, the real-life problem Istvan is trying to fix is
> that
> >>>>> you cannot run a Spark application with both HBase and Phoenix jars
> on
> >>>>> the classpath.
> >>>>>
> >>>>> If I understand this correctly, it's that the HBase API signatures
> are
> >>>>> different depending on whether we are "client side" or "server side"
> >>>>> (within a RegionServer). Your comment on PHOENIX-6053 shows that
> >>>>> (signatures on Table.java around Protobuf's Service class having
> shaded
> >>>>> relocation vs. the original com.google.protobuf coordinates).
> >>>>>
> >>>>> I think the reason we have the monolithic phoenix-core is that we
> have
> >>>>> so much logic which is executed on both the client and server side.
> For
> >>>>> example, we may push a filter operation to the server-side or we many
> >>>>> run it client-side. That's also why we have the "thin" phoenix-server
> >>>>> Maven module which just re-packages phoenix-core.
> >>>>>
> >>>>> Is it possible that we change phoenix-server so that it contains the
> >>>>> "server-side" code that we don't want to have using the HBase classes
> >>>>> with thirdparty relocations, rather than introduce another new Maven
> >>>>> module?
> >>>>>
> >>>>> Looking through your WIP PR too.
> >>>>>
> >>>>> On 4/7/21 1:10 AM, Istvan Toth wrote:
> >>>>>> Hi!
> >>>>>>
> >>>>>> I've been working on getting Phoenix working with
> >>>>> hbase-shaded-client.jar,
> >>>>>> and I am finally getting traction.
> >>>>>>
> >>>>>> One of the issues that I encountered is that we are mixing client
> and
> >>>>>> server side code in phoenix-core, and there's a
> >>>>>> mutual interdependence between the two.
> >>>>>>
> >>>>>> Fixing this is not hard, as it's mostly about replacing
> >>>> .class.getName()
> >>>>> s
> >>>>>> with string constants, and moving around some inconveniently placed
> >>>>> static
> >>>>>> utility methods, and now I have a WIP version where the client side
> >>>>> doesn't
> >>>>>> depend on server classes.
> >>>>>>
> >>>>>> However, unless we change the project structure, and factor out the
> >>>>> classes
> >>>>>> that depend on server-side APIs, this will be extremely fragile, as
> any
> >>>>>> change can (and will) re-introduce the circular dependency between
> the
> >>>>>> classes.
> >>>>>>
> >>>>>> To solve this issue I propose the following:
> >>>>>>
> >>>>>>       - clean up phoenix-core, so that only classes that depend
> only on
> >>>>>>       *hbase-client* (or at worst only on classes that are present
> in
> >>>>>>       *hbase-shaded-client*) remain. This should be 90+% of the code
> >>>>>>       - move all classes (mostly coprocessors and their support
> code)
> >>>> that
> >>>>> use
> >>>>>>       the server API (*hbase-server* mostly) to a new module, say
> >>>>>>       phoenix-coprocessors (the phoenix-server module name is
> taken).
> >>>> This
> >>>>> new
> >>>>>>       class depends on phoenix-core.
> >>>>>>       - move all classes that directly depend on MapReduce, and
> their
> >>>>> main()
> >>>>>>       classes to the existing phoenix-tools module (which also
> depends on
> >>>>> core)
> >>>>>>
> >>>>>> The separation would be primarily based on API use, at the first cut
> >>>> I'd
> >>>>> be
> >>>>>> fine with keeping all logic phoenix-core, and referencing that. We
> may
> >>>> or
> >>>>>> may not want to move logic that is only used in coprocessors or
> tools,
> >>>>> but
> >>>>>> doesn't use the respective APIs to the new modules later.
> >>>>>>
> >>>>>> As for the main artifacts:
> >>>>>>
> >>>>>>       - *phoenix-server.jar* would include code from all three
> classes.
> >>>>>>       - A newly added *phoenix-client-byo-shaded-hbase.jar *would
> include
> >>>>> only
> >>>>>>       the code from cleaned-up phoenix-core
> >>>>>>       - Ideally, we'd remove the the tools and coprocessor code (and
> >>>>>>       dependencies) from the standard and embedded clients, and
> switch
> >>>>>>       documentation to use *phoenix-server* to run the MR tools,
> but this
> >>>>> is
> >>>>>>       optional.
> >>>>>>
> >>>>>> I am tracking this work in PHOENIX-6053, which has a (currently
> >>>> working)
> >>>>>> WIP patch attached.
> >>>>>>
> >>>>>> I think that this change would fit the pattern established by
> creating
> >>>>> the
> >>>>>> phoenix-tools module,
> >>>>>> but as this is major change in project structure (even if the actual
> >>>> Java
> >>>>>> changes are trivial),
> >>>>>> I'd like to gather your input on this approach (please also speak
> up if
> >>>>> you
> >>>>>> agree).
> >>>>>>
> >>>>>> regards
> >>>>>> Istvan
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >> --
> >> István Tóth  | Staff Software Engineer
> >>
> >> st...@cloudera.com
> >>
> >>
> >> ________________________________
> >>
> >
>

Reply via email to