AjayBoddeda4 commented on issue #569: URL: https://github.com/apache/wayang/issues/569#issuecomment-4088008598
Hi, I am Ajay Boddeda, a GSoC 2026 applicant working on the DataFrames API proposal for Apache Wayang. This issue is very relevant to my proposal. One of the key advantages of building a proper DataFrame API using Spark Dataset[Row] as the backend is exactly this — avoiding per-element JVM to Python round trips entirely. When users write df.join() in the DataFrame API I am proposing, the join operation would be executed natively on Spark Dataset[Row] using Spark's optimized execution engine — no Python UDFs involved, no per-element serialization overhead. This means the DataFrame API would naturally solve this performance issue for join operations by keeping execution within Spark's optimized query planner rather than crossing the JVM-Python boundary repeatedly. Would love to discuss how this fits into the broader DataFrame API design. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
