Re: Iceberg 1.4/spark3.5 seem to have some breaking issue with spark-connect

Nirav Patel Wed, 28 Feb 2024 17:59:17 -0800

Thanks for sharing those issues. it does seem related to me based on
similar test case failures they had internally. i could try to drop
iceberg-runtime in jars dir of spark and see if that help avoid this as it
seems classloading issue comes up when loading using --jars args with
spark-connect


On Fri, Feb 23, 2024 at 1:39 AM Eduard Tudenhoefner <edu...@tabular.io>
wrote:

> I wonder if this is somewhat related to
> https://github.com/apache/spark/commit/6d0fed9a18ff87e73fdf1ee46b6b0d2df8dd5a1b
>  /
> SPARK-43744 <https://issues.apache.org/jira/browse/SPARK-43744>, which
> appears to have fixed similar issues that you were experiencing for Spark
> 3.5, but maybe some other place to be fixed was missed.
> The thing in Java is that if a class is being loaded by two different
> class loaders, then these two classes are never being considered equal.
> That means that a ClassCastException like the one you mentioned in your
> first email could happen for that reason.
>
> @Nirav were you able to test Spark 3.4 vs 3.5 with Iceberg 1.4.x? In your
> previous email you only mentioned Spark 3.5 + Iceberg 1.4.x vs Spark 3.4 +
> Iceberg 1.3.x, but I would only try and compare different Spark versions
> and keep Iceberg versions the same.
>
> To answer your question whether it might be an Iceberg vs Spark issue, I
> think it's a spark-connect issue with different classloaders. I've seen
> similar things in the past in other Java environments and Iceberg itself
> doesn't do anything fancy around classloading.
>
>
> On Thu, Feb 22, 2024 at 11:15 PM Nirav Patel <nira...@gmail.com> wrote:
>
>> Hi Ryan,
>>
>> I updated the spark-jira I opened with more information I found after
>> taking heapdump:
>>
>> https://issues.apache.org/jira/browse/SPARK-46762
>>
>>  class `org.apache.iceberg.Table` is loaded twice> once by
>> ChildFirstUrlClassLoader and once by MutableURLClassLoader .
>>
>> Issue doesn't happen with spark3.4 and iceberg 1.3 as I mentioned in
>> ticket. do you think it's still a spark-connect issue ? I noticed there's a
>> slightly bigger migratory changes in iceberg repo going from 1.3 to 1.4 in
>> order to support spark3.5 . DO you think something might have gotten missed
>> there?
>>
>>
>> Thanks
>> Nirav
>>
>> On Thu, Jan 18, 2024 at 9:46 AM Nirav Patel <nira...@gmail.com> wrote:
>>
>>> Classloading does seem like an issue while using it with Spark Connect
>>> 3.5 and iceberg >= 1.4 version only though.
>>>
>>> It's weird as I also mentioned in previous email that after adding spark
>>> property (spark.executor.userClassPathFirst=true) both classes gets loaded
>>> from same classloader - org.apache.spark.util.ChildFirstURLClassLoader. Not
>>> sure why error would still happen.
>>>
>>>  java.lang.ClassCastException: class
>>> org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast to
>>> class org.apache.iceberg.Table (org.apache.iceberg.spark.source.
>>> *SerializableTableWithSize* is in unnamed module of loader
>>> org.apache.spark.util.*ChildFirstURLClassLoader* @a41c33c;
>>> org.apache.iceberg.*Table* is in unnamed module of loader
>>> org.apache.spark.util.*ChildFirstURLClassLoader* @16f95afb)
>>>
>>>
>>> On Tue, Jan 16, 2024 at 12:53 PM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> It looks to me like the classloader is the problem. The "child first"
>>>> classloader is apparently loading `Table`, but Spark is loading
>>>> `SerializableTableWithSize` from the parent classloader. Because delegation
>>>> isn't happening properly, you're getting two incompatible classes from the
>>>> same classpath, depending on where a class was loaded for the first time.
>>>>
>>>> On Fri, Jan 12, 2024 at 5:30 PM Nirav Patel <nira...@gmail.com> wrote:
>>>>
>>>>> It seem to happening on executor of SC server as I see the error in
>>>>> executor logs. We did verify that there was only one version of
>>>>> iceberg-spark-runtime at the moment.
>>>>> We do include custom catalog imp jar. Though it's a shaded jar I don't
>>>>> see in "org/apache/iceberg/Table" or other iceberg classes when I do "jar
>>>>> -tvf" on that.
>>>>>
>>>>> I see both jars in 3 spark
>>>>> configs. spark.repl.local.jars, spark.yarn.dist.jars
>>>>> and spark.yarn.secondary.jars.
>>>>>
>>>>> I suspected classloading issue as well as initially error was pointing
>>>>> to it:
>>>>>
>>>>> pyspark.errors.exceptions.connect.SparkConnectGrpcException:
>>>>> (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0
>>>>> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>>>>> 0.0 (TID 3) (spark35-m.c.strivr-dev-test.internal executor 2):
>>>>> java.lang.ClassCastException: class
>>>>> org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast 
>>>>> to
>>>>> class org.apache.iceberg.Table
>>>>> (org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed
>>>>> module of loader org.apache.spark.util.*MutableURLClassLoader*
>>>>> @6819e13c; org.apache.iceberg.Table is in unnamed module of loader
>>>>> org.apache.spark.util.*ChildFirstURLClassLoader* @15fb0c43)
>>>>>
>>>>> Although *ChildFirstURLClassLoader *is child of MutableURLClassLoader
>>>>> error shouldn't be related to that.  I still try adding spark flag (--conf
>>>>> "spark.executor.userClassPathFirst=true") when starting spark connect
>>>>> server. it seem both classes gets loaded by same ClassLoader but error
>>>>> still happens:
>>>>>
>>>>> pyspark.errors.exceptions.connect.SparkConnectGrpcException:
>>>>> (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0
>>>>> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>>>>> 0.0 (TID 3) (spark35-m.c.strivr-dev-test.internal executor 2):
>>>>> java.lang.ClassCastException: class
>>>>> org.apache.iceberg.spark.source.SerializableTableWithSize cannot be cast 
>>>>> to
>>>>> class org.apache.iceberg.Table
>>>>> (org.apache.iceberg.spark.source.SerializableTableWithSize is in unnamed
>>>>> module of loader org.apache.spark.util.*ChildFirstURLClassLoader*
>>>>> @a41c33c; org.apache.iceberg.Table is in unnamed module of loader
>>>>> org.apache.spark.util.*ChildFirstURLClassLoader* @16f95afb)
>>>>>
>>>>> I see ClassLoader @ <some_id> in logs. Are those object Ids? (been
>>>>> awhile working with java) . wondering if multiple instance of same
>>>>> CLassLoader is being initialized by SC. may be doing --verbose:class or
>>>>> heap dump help to verify?
>>>>>
>>>>>
>>>>> On Fri, Jan 12, 2024 at 4:38 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>
>>>>>> I think it looks like a version mismatch, perhaps between the SC
>>>>>> client and the server or between where planning occurs and the executors.
>>>>>> The error is that the `SerializableTableWithSize` is not a subclass of
>>>>>> `Table`, but it definitely should be. That sort of problem is usually
>>>>>> caused by class loading issues. Can you double-check that you have only 
>>>>>> one
>>>>>> Iceberg runtime in the Environment tab of your Spark cluster?
>>>>>>
>>>>>> On Tue, Jan 9, 2024 at 4:57 PM Nirav Patel <nira...@gmail.com> wrote:
>>>>>>
>>>>>>> PS - issue doesn't happen if we don't use spark-connect and instead
>>>>>>> just use spark-shell or pyspark as OP in github said as well. however
>>>>>>> stacktrace desont seem to point any of the class from spark-connect jar
>>>>>>> (org.apache.spark:spark-connect_2.12:3.5.0).
>>>>>>>
>>>>>>> On Tue, Jan 9, 2024 at 4:52 PM Nirav Patel <nira...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> We are testing spark-connect with iceberg.
>>>>>>>> We tried spark 3.5, iceberg 1.4.x versions (all of
>>>>>>>> iceberg-spark-runtime-3.5_2.12-1.4.x.jar)
>>>>>>>>
>>>>>>>> with all the 1.4.x jars we are having following issue when running
>>>>>>>> iceberg queries from sparkSession created using spark-connect (--remote
>>>>>>>> "sc://remote-master-node")
>>>>>>>>
>>>>>>>> org.apache.iceberg.spark.source.SerializableTableWithSize cannot be
>>>>>>>> cast to org.apache.iceberg.Table at
>>>>>>>> org.apache.iceberg.spark.source.SparkInputPartition.table(SparkInputPartition.java:88)
>>>>>>>> at
>>>>>>>> org.apache.iceberg.spark.source.BatchDataReader.<init>(BatchDataReader.java:50)
>>>>>>>> at
>>>>>>>> org.apache.iceberg.spark.source.SparkColumnarReaderFactory.createColumnarReader(SparkColumnarReaderFactory.java:52)
>>>>>>>> at
>>>>>>>> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:79)
>>>>>>>> at
>>>>>>>> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
>>>>>>>> at
>>>>>>>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>>>>>>>> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
>>>>>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>>>>>>>> Source) at
>>>>>>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithKeys_0$(Unknown
>>>>>>>> Source) at
>>>>>>>>
>>>>>>>> Someone else has reported this issue on github as well:
>>>>>>>> https://github.com/apache/iceberg/issues/8978
>>>>>>>>
>>>>>>>> It's currently working with spark 3.4 and iceberg 1.3 . However
>>>>>>>> Ideally it'd be nice to get it working with spark 3.5 as well as 3.5 
>>>>>>>> has
>>>>>>>> many improvements in spark-connect.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Nirav
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>

Re: Iceberg 1.4/spark3.5 seem to have some breaking issue with spark-connect

Reply via email to