mengw15 commented on issue #3496:
URL: https://github.com/apache/texera/issues/3496#issuecomment-3289995264

   I had a discussion with @Yicong-Huang  regarding this issue and would like 
to share our findings and next steps.
   
   ## Output Tuple Count
   
   First, we were not able to reproduce the issue with the incorrect number of 
output tuples from the R UDF. In our tests, the number of output tuples matched 
the number of input tuples as expected.
   
   ## Performance Bottleneck – Two Considerations
   
   As for the performance bottleneck, we guess there are two possible factors:
   
   1. Transfer Overhead (Python → R)
   
   Yicong suspects that the performance drop could be due to how data is 
transferred from Python to R. Currently, we may be sending the data as 
DataFrame format (will check if this is true), which can introduce significant 
overhead. Instead, we should consider transferring the data as Arrow record 
batches. Moreover, when the target UDF is R, we should avoid converting the 
Arrow batch into a DataFrame on the Python side, and instead directly forward 
the Arrow data to R.
   
   2. Processing Time in R
   
   If the bottleneck lies in R’s processing speed, one potential improvement is 
to adopt an approach similar to the Python UDF, where we pass source code and 
let the R process spin up a thread to run that code. This could potentially 
improve performance.
   
   ### Next Steps
   
   I plan to conduct a set of tests to better understand the performance gap 
   including comparing the execution time of R code in Texera versus native R
   and measuring the data transfer speed differences between Python UDFs and R 
UDFs within Texera.
   
   Initially, I’ll focus on investigating data transfer speed to validate the 
first hypothesis.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to