Hi All, I'm currently working on a project involving transferring between Spark 3.x (I use Scala) and a Python runtime. In Spark, data is stored in an RDD as floating-point number arrays/vectors and I have custom routines written in Python to process them. On the Spark side, I also have some operations specific to Spark Scala APIs, so I need to use both runtimes.
Now to achieve data transfer I've been using the RDD.pipe() API, by 1. converting the arrays to strings in Spark and calling RDD.pipe(script.py) 2. Then Python receives the strings and casts them as Python's data structures and conducts operations. 3. Python converts the arrays into strings and prints them back to Spark. 4. Spark gets the strings and cast them back as arrays. Needless to say, this feels unnatural and slow to me, and there are some potential floating-point number precision issues, as I think the floating number arrays should have been transmitted as raw bytes. I found no way to use the RDD.pipe() for this purpose, as written in https://github.com/apache/spark/blob/3331d4ccb7df9aeb1972ed86472269a9dbd261ff/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala#L139, .pipe() seems to be locked with text-based streaming. Can anyone shed some light on how I can achieve this? I'm trying to come up with a way that does not involve modifying the core Spark myself. One potential solution I can think of is saving/loading the RDD as binary files but I'm hoping to find a streaming-based solution. Any help is much appreciated, thanks! Best regards, Yuhao