Re: Issues with Beam SQL on Spark

Andrew Pilloud Tue, 24 Jul 2018 09:57:13 -0700

Looks like calcite isn't easily repackageable. This issue can be fixed
either in our shading (by also rewriting the "jdbc:calcite:" string when we
shade calcite) or in calcite (by not using the driver manager to connect
between calcite modules).


Andrew

On Mon, Jul 23, 2018 at 11:18 PM Kai Jiang <jiang...@gmail.com> wrote:

> Hi all,
>
> I met an issue when I ran Beam SQL on Spark. I want to check and see if
> anyone has same issue with me. I believe let beam sql running on spark is
> important. If you encountered same problem, it will be really helpful if
> you could give some inputs.
>
> Context:
> I setup TPC framework to run sql on spark. Code
> <https://github.com/vectorijk/beam/blob/tpch/sdks/java/extensions/tpc/src/main/java/org/apache/beam/sdk/extensions/tpc/BeamTpc.java>
> is simple which just ingests csv data and apply Sql on that. Gradle
> <https://github.com/vectorijk/beam/blob/tpch/sdks/java/extensions/tpc/build.gradle>
>  setting
> includes `runner-spark` and necessary libraries.  Exception Stack trace
> <https://gist.github.com/vectorijk/849cbcd5bce558e5e7c97916ca4c793a> shows
> some details. However, same code can running on Flink and Dataflow
> successfully.
>
> Investigations:
> BEAM-3386 <https://issues.apache.org/jira/browse/BEAM-3386> also
> describes the similar issue I have. It took me some time on investigating
> it. I guess there should be a version conflict between Calcite library in
> Spark and Beam SQL repackaged Calcite. The version of Calcite library Spark
> ( * - 2.3.1) used is very old (1.2.0-incubating).
>
> After packaging fat jar and submitting it to Spark, Spark registered both
> old version's calcite jdbc driver and Beam's repackaged jdbc driver in
>
> registeredDrivers(DriverManager.java#L294 
> <https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/sql/DriverManager.java#L294>).
>  Jdbc's DriverManager always connects to old version calcite's jdbc in spark 
> instead of beam's repackaged calcite.
>
>
> Looking into Line DriverManager.java#L556 
> <https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/sql/DriverManager.java#L556>
>  and insert a breakpoint, aClass =  
> Class.forName(driver.getClass().getName(), true, classLoader);
>
> driver.getClass().getName() -> "org.apache.calcite.jdbc.Driver"
> classLoader only has class 'org.apache.beam.**' and
> 'org.apache.beam.repackaged.beam_***'. (There is no path of class
> 'org.apache.calcite.*')
>
> Oddly, aClass is assigned with Class "org.apache.calcite.jdbc.Driver". I
> think it should raise an exception and be skipped. Actually, It did not.  So
> this spark's calcite jdbc driver has been connected. All logic afterwards
> goes to spark's calcite classpath. I believe that's pivot point.
>
> Potentially solutions:
> *1.* Figure out why DriverManager.java#L556
> <https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/sql/DriverManager.java#L556>
>  does
> not throw exception.
>
> I guess it is the best option.
>
> 2. Upgrade Spark' calcite.
>
> It is not a good option because old calcite version affects many spark
> versions.
>
> 3. Not using repackage for calcite library.
>
> I tried. I built fat jar with non-repackaged calcite. But, Spark is still
> using its own calcite.
>
> Plus, I am curious if there is any specific reason we need to use
> repackage strategy for Calcite. @Mingmin Xu <mingm...@gmail.com>
>
>
> Thanks for reading!
>
> Best,
> Kai
> ᐧ
>

Re: Issues with Beam SQL on Spark

Reply via email to