[ https://issues.apache.org/jira/browse/SPARK-40011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577693#comment-17577693 ]
Hyukjin Kwon commented on SPARK-40011: -------------------------------------- Pandas API on Spark needs that because it reuses several metadata objects from pandas, and that's explicitly declared at https://github.com/apache/spark/blob/master/python/setup.py#L271-L275. > Pandas API on Spark requires Pandas > ----------------------------------- > > Key: SPARK-40011 > URL: https://issues.apache.org/jira/browse/SPARK-40011 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark > Affects Versions: 3.3.0 > Reporter: Daniel Oakley > Priority: Major > > Pandas API on Spark includes code like: > > import pandas as pd > > from pandas.api.types import is_hashable, is_list_like # type: > > ignore[attr-defined] > This breaks if you don't have pandas installed on your Spark cluster. > Pandas API was supposed to be an API not pandas integration, why does it > require pandas to be installed? > In many places Spark jobs may be run on various Spark clusters with no > assurance of particular Python packages installed at a root level. > Can this dependency be removed? Or the required version of Pandas be bundled > with the Spark distribution? Similar for numpy and other deps. > If not the docs should clearly state it is not merely a Spark API that mirror > the Pandas API, but something quite different. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org