Alex Khakhlyuk created SPARK-53917:
--------------------------------------
Summary: [CONNECT] Supporting large LocalRelations
Key: SPARK-53917
URL: https://issues.apache.org/jira/browse/SPARK-53917
Project: Spark
Issue Type: Improvement
Components: Connect, PySpark
Affects Versions: 4.1.0
Reporter: Alex Khakhlyuk
Fix For: 4.1.0
LocalRelation is a Catalyst logical operator used to represent a dataset of
rows inline as part of the LogicalPlan. LocalRelations represent dataframes
created directly from Python and Scala objects, e.g., Python and Scala lists,
pandas dataframes, csv files loaded in memory, etc.
In Spark Connect, local relations are transferred over gRPC using LocalRelation
(for relations under 64MB) and CachedLocalRelation (larger relations over 64MB)
messages.
CachedLocalRelations currently have a hard size limit of 2GB, which means that
spark users can’t execute queries with local client data, pandas dataframes,
csv files of over 2GB.
I propose removing this limit by introducing a new proto message for
transferring large LocalRelations in chunks and by adding batch processing of
the data both on the client (python and scala) and on the server.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]