Re: Iceberg 1.4/spark3.5 seem to have some breaking issue with spark-connect

Eduard Tudenhoefner Fri, 23 Feb 2024 01:39:40 -0800

I wonder if this is somewhat related to
https://github.com/apache/spark/commit/6d0fed9a18ff87e73fdf1ee46b6b0d2df8dd5a1b
/
SPARK-43744 <https://issues.apache.org/jira/browse/SPARK-43744>, which
appears to have fixed similar issues that you were experiencing for Spark
3.5, but maybe some other place to be fixed was missed.
The thing in Java is that if a class is being loaded by two different class
loaders, then these two classes are never being considered equal.
That means that a ClassCastException like the one you mentioned in your
first email could happen for that reason.


@Nirav were you able to test Spark 3.4 vs 3.5 with Iceberg 1.4.x? In your
previous email you only mentioned Spark 3.5 + Iceberg 1.4.x vs Spark 3.4 +
Iceberg 1.3.x, but I would only try and compare different Spark versions
and keep Iceberg versions the same.

To answer your question whether it might be an Iceberg vs Spark issue, I
think it's a spark-connect issue with different classloaders. I've seen
similar things in the past in other Java environments and Iceberg itself
doesn't do anything fancy around classloading.


On Thu, Feb 22, 2024 at 11:15 PM Nirav Patel <nira...@gmail.com> wrote:

> Hi Ryan,
>
> I updated the spark-jira I opened with more information I found after
> taking heapdump:
>
> https://issues.apache.org/jira/browse/SPARK-46762
>
>  class `org.apache.iceberg.Table` is loaded twice> once by
> ChildFirstUrlClassLoader and once by MutableURLClassLoader .
>
> Issue doesn't happen with spark3.4 and iceberg 1.3 as I mentioned in
> ticket. do you think it's still a spark-connect issue ? I noticed there's a
> slightly bigger migratory changes in iceberg repo going from 1.3 to 1.4 in
> order to support spark3.5 . DO you think something might have gotten missed
> there?
>
>
> Thanks
> Nirav
>
> On Thu, Jan 18, 2024 at 9:46 AM Nirav Patel <nira...@gmail.com> wrote:
>
>> Classloading does seem like an issue while using it with Spark Connect
>> 3.5 and iceberg >= 1.4 version only though.
>>
>> It's weird as I also mentioned in previous email that after adding spark
>> property (spark.executor.userClassPathFirst=true) both classes gets loaded
>> from same classloader - org.apache.spark.util.ChildFirstURLClassLoader. Not
>> sure why error would still happen.
>>
>>  java.lang.ClassCastException: class
>> org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to
>> class org.apache.iceberg.Table (org.apache.iceberg.spark.source.
>> *SerializableTableWithSize* is in unnamed module of loader
>> org.apache.spark.util.*ChildFirstURLClassLoader* @a41c33c;
>> org.apache.iceberg.*Table* is in unnamed module of loader
>> org.apache.spark.util.*ChildFirstURLClassLoader* @16f95afb)
>>
>>
>> On Tue, Jan 16, 2024 at 12:53 PM Ryan Blue <b...@tabular.io> wrote:
>>
>>> It looks to me like the classloader is the problem. The "child first"
>>> classloader is apparently loading `Table`, but Spark is loading
>>> `SerializableTableWithSize` from the parent classloader. Because delegation
>>> isn't happening properly, you're getting two incompatible classes from the
>>> same classpath, depending on where a class was loaded for the first time.
>>>
>>> On Fri, Jan 12, 2024 at 5:30 PM Nirav Patel <nira...@gmail.com> wrote:
>>>
>>>> It seem to happening on executor of SC server as I see the error in
>>>> executor logs. We did verify that there was only one version of
>>>> iceberg-spark-runtime at the moment.
>>>> We do include custom catalog imp jar. Though it's a shaded jar I don't
>>>> see in "org/apache/iceberg/Table" or other iceberg classes when I do "jar
>>>> -tvf" on that.
>>>>
>>>> I see both jars in 3 spark
>>>> configs. spark.repl.local.jars, spark.yarn.dist.jars
>>>> and spark.yarn.secondary.jars.
>>>>
>>>> I suspected classloading issue as well as initially error was pointing
>>>> to it:
>>>>
>>>> pyspark.errors.exceptions.connect.SparkConnectGrpcException:
>>>> (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0
>>>> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>>>> 0.0 (TID 3) (spark35-m.c.strivr-dev-test.internal executor 2):
>>>> java.lang.ClassCastException: class
>>>> org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to
>>>> class org.apache.iceberg.Table
>>>> (org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed
>>>> module of loader org.apache.spark.util.*MutableURLClassLoader*
>>>> @6819e13c; org.apache.iceberg.Table is in unnamed module of loader
>>>> org.apache.spark.util.*ChildFirstURLClassLoader* @15fb0c43)
>>>>
>>>> Although *ChildFirstURLClassLoader *is child of MutableURLClassLoader
>>>> error shouldn't be related to that.  I still try adding spark flag (--conf
>>>> "spark.executor.userClassPathFirst=true") when starting spark connect
>>>> server. it seem both classes gets loaded by same ClassLoader but error
>>>> still happens:
>>>>
>>>> pyspark.errors.exceptions.connect.SparkConnectGrpcException:
>>>> (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0
>>>> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>>>> 0.0 (TID 3) (spark35-m.c.strivr-dev-test.internal executor 2):
>>>> java.lang.ClassCastException: class
>>>> org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to
>>>> class org.apache.iceberg.Table
>>>> (org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed
>>>> module of loader org.apache.spark.util.*ChildFirstURLClassLoader*
>>>> @a41c33c; org.apache.iceberg.Table is in unnamed module of loader
>>>> org.apache.spark.util.*ChildFirstURLClassLoader* @16f95afb)
>>>>
>>>> I see ClassLoader @ <some_id> in logs. Are those object Ids? (been
>>>> awhile working with java) . wondering if multiple instance of same
>>>> CLassLoader is being initialized by SC. may be doing --verbose:class or
>>>> heap dump help to verify?
>>>>
>>>>
>>>> On Fri, Jan 12, 2024 at 4:38 PM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> I think it looks like a version mismatch, perhaps between the SC
>>>>> client and the server or between where planning occurs and the executors.
>>>>> The error is that the `SerializableTableWithSize` is not a subclass of
>>>>> `Table`, but it definitely should be. That sort of problem is usually
>>>>> caused by class loading issues. Can you double-check that you have only 
>>>>> one
>>>>> Iceberg runtime in the Environment tab of your Spark cluster?
>>>>>
>>>>> On Tue, Jan 9, 2024 at 4:57 PM Nirav Patel <nira...@gmail.com> wrote:
>>>>>
>>>>>> PS - issue doesn't happen if we don't use spark-connect and instead
>>>>>> just use spark-shell or pyspark as OP in github said as well. however
>>>>>> stacktrace desont seem to point any of the class from spark-connect jar
>>>>>> (org.apache.spark:spark-connect_2.12:3.5.0).
>>>>>>
>>>>>> On Tue, Jan 9, 2024 at 4:52 PM Nirav Patel <nira...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> We are testing spark-connect with iceberg.
>>>>>>> We tried spark 3.5, iceberg 1.4.x versions (all of
>>>>>>> iceberg-spark-runtime-3.5_2.12-1.4.x.jar)
>>>>>>>
>>>>>>> with all the 1.4.x jars we are having following issue when running
>>>>>>> iceberg queries from sparkSession created using spark-connect (--remote
>>>>>>> "sc://remote-master-node")
>>>>>>>
>>>>>>> org.apache.iceberg.spark.source.SerializableTableWithSize cannot be
>>>>>>> cast to org.apache.iceberg.Table at
>>>>>>> org.apache.iceberg.spark.source.SparkInputPartition.table(SparkInputPartition.java:88)
>>>>>>> at
>>>>>>> org.apache.iceberg.spark.source.BatchDataReader.<init>(BatchDataReader.java:50)
>>>>>>> at
>>>>>>> org.apache.iceberg.spark.source.SparkColumnarReaderFactory.createColumnarReader(SparkColumnarReaderFactory.java:52)
>>>>>>> at
>>>>>>> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:79)
>>>>>>> at
>>>>>>> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
>>>>>>> at
>>>>>>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>>>>>>> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
>>>>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>>>>>>> Source) at
>>>>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithKeys_0$(Unknown
>>>>>>> Source) at
>>>>>>>
>>>>>>> Someone else has reported this issue on github as well:
>>>>>>> https://github.com/apache/iceberg/issues/8978
>>>>>>>
>>>>>>> It's currently working with spark 3.4 and iceberg 1.3 . However
>>>>>>> Ideally it'd be nice to get it working with spark 3.5 as well as 3.5 has
>>>>>>> many improvements in spark-connect.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Nirav
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

Re: Iceberg 1.4/spark3.5 seem to have some breaking issue with spark-connect

Reply via email to