Re: Issues with Beam SQL on Spark

Kai Jiang Sun, 29 Jul 2018 21:07:52 -0700

Hi Andrew,

I tried on replacing "jdbc:calcite" to "jdbc:beam" in calcite and
re-shadow. After that, Beam Sql can run on Spark now.
However, I didn't find an approach to modify code during shading Calcite
library. I think second method you mentioned is feasible.
I'll forward this thread to dev@calcite and to see if we can connect
between calcite modules without using the DriverManager.


Best,
Kai
ᐧ

On Tue, Jul 24, 2018 at 1:04 PM Kai Jiang <jiang...@gmail.com> wrote:

> Thank you Andrew! I will take a look at if it is feasible to rewrite
> "jdbc:calcite:" in Beam's repackaged calcite.
>
> Best,
> Kai
>
> On 2018/07/24 19:08:17, Andrew Pilloud <apill...@google.com> wrote:
> > I don't really think this is something that involves changes to
> > DriverManager. Beam is causing the problem by relocating calcite's path
> but
> > not also modifying the global state it creates.
> >
> > Andrew
> >
> > On Tue, Jul 24, 2018 at 12:03 PM Kai Jiang <jiang...@gmail.com> wrote:
> >
> > > Thanks Andrew! It's really helpful. I'll take a try on shade calcite
> with
> > > rewriting the "jdbc:calcite".
> > > I also have a look at the doc of DriverManager. Do you think include
> all
> > > repackaged jdbc driver property setting like below will be helpful?
> > >  jdbc.drivers=org.apache.beam.repackaged.beam.
> > >
> > > Best,
> > > Kai
> > >
> > > On 2018/07/24 16:56:50, Andrew Pilloud <apill...@google.com> wrote:
> > > > Looks like calcite isn't easily repackageable. This issue can be
> fixed
> > > > either in our shading (by also rewriting the "jdbc:calcite:" string
> when
> > > we
> > > > shade calcite) or in calcite (by not using the driver manager to
> connect
> > > > between calcite modules).
> > > >
> > > > Andrew
> > > >
> > > > On Mon, Jul 23, 2018 at 11:18 PM Kai Jiang <jiang...@gmail.com>
> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I met an issue when I ran Beam SQL on Spark. I want to check and
> see if
> > > > > anyone has same issue with me. I believe let beam sql running on
> spark
> > > is
> > > > > important. If you encountered same problem, it will be really
> helpful
> > > if
> > > > > you could give some inputs.
> > > > >
> > > > > Context:
> > > > > I setup TPC framework to run sql on spark. Code
> > > > > <
> > >
> https://github.com/vectorijk/beam/blob/tpch/sdks/java/extensions/tpc/src/main/java/org/apache/beam/sdk/extensions/tpc/BeamTpc.java
> > > >
> > > > > is simple which just ingests csv data and apply Sql on that. Gradle
> > > > > <
> > >
> https://github.com/vectorijk/beam/blob/tpch/sdks/java/extensions/tpc/build.gradle
> >
> > > setting
> > > > > includes `runner-spark` and necessary libraries.  Exception Stack
> trace
> > > > > <
> https://gist.github.com/vectorijk/849cbcd5bce558e5e7c97916ca4c793a>
> > > shows
> > > > > some details. However, same code can running on Flink and Dataflow
> > > > > successfully.
> > > > >
> > > > > Investigations:
> > > > > BEAM-3386 <https://issues.apache.org/jira/browse/BEAM-3386> also
> > > > > describes the similar issue I have. It took me some time on
> > > investigating
> > > > > it. I guess there should be a version conflict between Calcite
> library
> > > in
> > > > > Spark and Beam SQL repackaged Calcite. The version of Calcite
> library
> > > Spark
> > > > > ( * - 2.3.1) used is very old (1.2.0-incubating).
> > > > >
> > > > > After packaging fat jar and submitting it to Spark, Spark
> registered
> > > both
> > > > > old version's calcite jdbc driver and Beam's repackaged jdbc
> driver in
> > > > >
> > > > > registeredDrivers(DriverManager.java#L294 <
> > >
> https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/sql/DriverManager.java#L294
> >).
> > > Jdbc's DriverManager always connects to old version calcite's jdbc in
> spark
> > > instead of beam's repackaged calcite.
> > > > >
> > > > >
> > > > > Looking into Line DriverManager.java#L556 <
> > >
> https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/sql/DriverManager.java#L556
> >
> > > and insert a breakpoint, aClass =
> > > Class.forName(driver.getClass().getName(), true, classLoader);
> > > > >
> > > > > driver.getClass().getName() -> "org.apache.calcite.jdbc.Driver"
> > > > > classLoader only has class 'org.apache.beam.**' and
> > > > > 'org.apache.beam.repackaged.beam_***'. (There is no path of class
> > > > > 'org.apache.calcite.*')
> > > > >
> > > > > Oddly, aClass is assigned with Class
> "org.apache.calcite.jdbc.Driver".
> > > I
> > > > > think it should raise an exception and be skipped. Actually, It did
> > > not.  So
> > > > > this spark's calcite jdbc driver has been connected. All logic
> > > afterwards
> > > > > goes to spark's calcite classpath. I believe that's pivot point.
> > > > >
> > > > > Potentially solutions:
> > > > > *1.* Figure out why DriverManager.java#L556
> > > > > <
> > >
> https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/sql/DriverManager.java#L556
> >
> > > does
> > > > > not throw exception.
> > > > >
> > > > > I guess it is the best option.
> > > > >
> > > > > 2. Upgrade Spark' calcite.
> > > > >
> > > > > It is not a good option because old calcite version affects many
> spark
> > > > > versions.
> > > > >
> > > > > 3. Not using repackage for calcite library.
> > > > >
> > > > > I tried. I built fat jar with non-repackaged calcite. But, Spark is
> > > still
> > > > > using its own calcite.
> > > > >
> > > > > Plus, I am curious if there is any specific reason we need to use
> > > > > repackage strategy for Calcite. @Mingmin Xu <mingm...@gmail.com>
> > > > >
> > > > >
> > > > > Thanks for reading!
> > > > >
> > > > > Best,
> > > > > Kai
> > > > > ᐧ
> > > > >
> > > >
> > >
> >
>

Re: Issues with Beam SQL on Spark

Reply via email to