[ 
https://issues.apache.org/jira/browse/SPARK-53917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Khakhlyuk updated SPARK-53917:
-----------------------------------
    Attachment: image-2025-10-15-13-56-46-840.png

> [CONNECT] Supporting large LocalRelations
> -----------------------------------------
>
>                 Key: SPARK-53917
>                 URL: https://issues.apache.org/jira/browse/SPARK-53917
>             Project: Spark
>          Issue Type: Improvement
>          Components: Connect, PySpark
>    Affects Versions: 4.1.0
>            Reporter: Alex Khakhlyuk
>            Priority: Major
>             Fix For: 4.1.0
>
>         Attachments: image-2025-10-15-13-50-04-179.png, 
> image-2025-10-15-13-50-44-333.png, image-2025-10-15-13-53-40-306.png, 
> image-2025-10-15-13-56-46-840.png
>
>
> LocalRelation is a Catalyst logical operator used to represent a dataset of 
> rows inline as part of the LogicalPlan. LocalRelations represent dataframes 
> created directly from Python and Scala objects, e.g., Python and Scala lists, 
> pandas dataframes, csv files loaded in memory, etc.
> In Spark Connect, local relations are transferred over gRPC using 
> LocalRelation (for relations under 64MB) and CachedLocalRelation (larger 
> relations over 64MB) messages.
> CachedLocalRelations currently have a hard size limit of 2GB, which means 
> that spark users can’t execute queries with local client data, pandas 
> dataframes, csv files of over 2GB.
> I propose removing this limit by introducing a new proto message for 
> transferring large LocalRelations in chunks and by adding batch processing of 
> the data both on the client (python and scala) and on the server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to