Re: [I] Join in Python is very slow [wayang]

via GitHub Wed, 18 Mar 2026 22:58:42 -0700


AjayBoddeda4 commented on issue #569:
URL: https://github.com/apache/wayang/issues/569#issuecomment-4088008598


   Hi, I am Ajay Boddeda, a GSoC 2026 applicant working on the DataFrames API 
proposal for Apache Wayang.
   This issue is very relevant to my proposal. One of the key advantages of 
building a proper DataFrame API using Spark Dataset[Row] as the backend is 
exactly this — avoiding per-element JVM to Python round trips entirely.
   When users write df.join() in the DataFrame API I am proposing, the join 
operation would be executed natively on Spark Dataset[Row] using Spark's 
optimized execution engine — no Python UDFs involved, no per-element 
serialization overhead.
   This means the DataFrame API would naturally solve this performance issue 
for join operations by keeping execution within Spark's optimized query planner 
rather than crossing the JVM-Python boundary repeatedly.
   Would love to discuss how this fits into the broader DataFrame API design.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Join in Python is very slow [wayang]

Reply via email to