Alex Khakhlyuk created SPARK-53917:
--------------------------------------

             Summary: [CONNECT] Supporting large LocalRelations
                 Key: SPARK-53917
                 URL: https://issues.apache.org/jira/browse/SPARK-53917
             Project: Spark
          Issue Type: Improvement
          Components: Connect, PySpark
    Affects Versions: 4.1.0
            Reporter: Alex Khakhlyuk
             Fix For: 4.1.0


LocalRelation is a Catalyst logical operator used to represent a dataset of 
rows inline as part of the LogicalPlan. LocalRelations represent dataframes 
created directly from Python and Scala objects, e.g., Python and Scala lists, 
pandas dataframes, csv files loaded in memory, etc.

In Spark Connect, local relations are transferred over gRPC using LocalRelation 
(for relations under 64MB) and CachedLocalRelation (larger relations over 64MB) 
messages.

CachedLocalRelations currently have a hard size limit of 2GB, which means that 
spark users can’t execute queries with local client data, pandas dataframes, 
csv files of over 2GB.

I propose removing this limit by introducing a new proto message for 
transferring large LocalRelations in chunks and by adding batch processing of 
the data both on the client (python and scala) and on the server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to