[ https://issues.apache.org/jira/browse/SPARK-50476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903798#comment-17903798 ]
Mykola Melnyk commented on SPARK-50476: --------------------------------------- Hello [~Gengliang.Wang] Thank you for the response. Yes, I understand that _PartitionedFile_ class is internal Spark class, but I did not found another way to handle manual partitioning on data source level. I found this as one possible way for implement lazy reading pages from PDF document. Or maybe do you have another ideas how to implement it, without using *PartitionedFile* class? I will test PDF Data Source on Databricks after January 20, 2025. Thanks a lot! > Unable to run custom PDF Data Source on Databricks > -------------------------------------------------- > > Key: SPARK-50476 > URL: https://issues.apache.org/jira/browse/SPARK-50476 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.5.0 > Environment: Databricks Runtime: 14.3 LTS (includes Apache Spark > 3.5.0, Scala 2.12) > Reporter: Mykola Melnyk > Priority: Minor > Attachments: PdfDataSourceDatabricks.ipynb, traceback.txt > > > Experienced error when running custom PDF DataSource on Databricks Runtime: > 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12) > PDF DataSource works fine on the community version of the Spark 3.5.0, but > fail on Databricks: > Log: > Py4JJavaError: An error occurred while calling o428.showString. > : java.lang.NoSuchMethodError: > org.apache.spark.sql.execution.datasources.PartitionedFile.(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/paths/SparkPath;JJ[Ljava/lang/String;JJLscala/collection/immutable/Map;)V > at > com.stabrise.sparkpdf.datasources.PdfPartitionedFileUtil$.$anonfun$splitFiles$1(PdfPartitionedFileUtil.scala:32) > at > com.stabrise.sparkpdf.datasources.PdfPartitionedFileUtil$.$anonfun$splitFiles$1$adapted(PdfPartitionedFileUtil.scala:28) > at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > [Source Code of the PDF > DataSource|https://github.com/StabRise/spark-pdf/tree/spark_3_5] > [Source code of PdfPartitionedFileUtil.scala > |https://github.com/StabRise/spark-pdf/blob/spark_3_5/src/main/scala/datasources/PdfPartitionedFileUtil.scala#L32] > PDF DataSource jar file: > [https://github.com/StabRise/spark-pdf/releases/download/0.1.12_spark_3_5/spark-pdf-0.1.12.jar] > Notebook with full example and traceback: > [https://github.com/StabRise/spark-pdf/blob/spark_3_5/examples/PdfDataSourceDatabricks.ipynb|https://github.com/StabRise/spark-pdf/blob/spark_3_5/examples/PdfDataSourceDatabricks.ipynb] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org