[jira] [Commented] (SPARK-50476) Unable to run custom PDF Data Source on Databricks

Mykola Melnyk (Jira) Fri, 06 Dec 2024 22:15:54 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-50476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903798#comment-17903798
 ]


Mykola Melnyk commented on SPARK-50476:
---------------------------------------

Hello [~Gengliang.Wang]

Thank you for the response.
Yes, I understand that _PartitionedFile_ class is internal Spark class, but I 
did not found another way to handle manual partitioning on data source level.
I found this as one possible way for implement lazy reading pages from PDF 
document.

Or maybe do you have another ideas how to implement it, without using 
*PartitionedFile* class?

I will test PDF Data Source on Databricks after January 20, 2025.

Thanks a lot!

> Unable to run custom PDF Data Source on Databricks
> --------------------------------------------------
>
>                 Key: SPARK-50476
>                 URL: https://issues.apache.org/jira/browse/SPARK-50476
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.5.0
>         Environment: Databricks Runtime: 14.3 LTS (includes Apache Spark 
> 3.5.0, Scala 2.12)
>            Reporter: Mykola Melnyk
>            Priority: Minor
>         Attachments: PdfDataSourceDatabricks.ipynb, traceback.txt
>
>
> Experienced error when running custom PDF DataSource on Databricks Runtime: 
> 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)
> PDF DataSource works fine on the community version of the Spark 3.5.0, but 
> fail on Databricks:
> Log:
> Py4JJavaError: An error occurred while calling o428.showString.
> : java.lang.NoSuchMethodError: 
> org.apache.spark.sql.execution.datasources.PartitionedFile.(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/paths/SparkPath;JJ[Ljava/lang/String;JJLscala/collection/immutable/Map;)V
> at 
> com.stabrise.sparkpdf.datasources.PdfPartitionedFileUtil$.$anonfun$splitFiles$1(PdfPartitionedFileUtil.scala:32)
> at 
> com.stabrise.sparkpdf.datasources.PdfPartitionedFileUtil$.$anonfun$splitFiles$1$adapted(PdfPartitionedFileUtil.scala:28)
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
> [Source Code of the PDF 
> DataSource|https://github.com/StabRise/spark-pdf/tree/spark_3_5]
> [Source code of PdfPartitionedFileUtil.scala 
> |https://github.com/StabRise/spark-pdf/blob/spark_3_5/src/main/scala/datasources/PdfPartitionedFileUtil.scala#L32]
> PDF DataSource jar file:  
> [https://github.com/StabRise/spark-pdf/releases/download/0.1.12_spark_3_5/spark-pdf-0.1.12.jar]
> Notebook with full example and traceback: 
> [https://github.com/StabRise/spark-pdf/blob/spark_3_5/examples/PdfDataSourceDatabricks.ipynb|https://github.com/StabRise/spark-pdf/blob/spark_3_5/examples/PdfDataSourceDatabricks.ipynb]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-50476) Unable to run custom PDF Data Source on Databricks

Reply via email to