mengw15 commented on issue #3496: URL: https://github.com/apache/texera/issues/3496#issuecomment-3289995264
I had a discussion with @Yicong-Huang regarding this issue and would like to share our findings and next steps. ## Output Tuple Count First, we were not able to reproduce the issue with the incorrect number of output tuples from the R UDF. In our tests, the number of output tuples matched the number of input tuples as expected. ## Performance Bottleneck – Two Considerations As for the performance bottleneck, we guess there are two possible factors: 1. Transfer Overhead (Python → R) Yicong suspects that the performance drop could be due to how data is transferred from Python to R. Currently, we may be sending the data as DataFrame format (will check if this is true), which can introduce significant overhead. Instead, we should consider transferring the data as Arrow record batches. Moreover, when the target UDF is R, we should avoid converting the Arrow batch into a DataFrame on the Python side, and instead directly forward the Arrow data to R. 2. Processing Time in R If the bottleneck lies in R’s processing speed, one potential improvement is to adopt an approach similar to the Python UDF, where we pass source code and let the R process spin up a thread to run that code. This could potentially improve performance. ### Next Steps I plan to conduct a set of tests to better understand the performance gap including comparing the execution time of R code in Texera versus native R and measuring the data transfer speed differences between Python UDFs and R UDFs within Texera. Initially, I’ll focus on investigating data transfer speed to validate the first hypothesis. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
