Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversion in streaming fashion

Tanveer Ahmad - EWI Mon, 25 May 2020 05:47:30 -0700

Hi all,


I need some help regarding Arrow RecordBatches/Pandas Dataframes to (Arrow 
enabled) Spark Dataframe conversions.

Here the example explains very well how to convert a single Pandas Dataframe to 
Spark Dataframe [1].


But in my case, some external applications are generating Arrow RecordBatches 
in my PySpark application in streaming fashion. Each time I receive an Arrow 
RB, I want to transfer/append it to a Spark Dataframe. So is it possible to 
create a Spark Dataframe initially from one Arrow RecordBatch and then start 
appending many other in-coming Arrow RecordBatches to that Spark Dataframe 
(like in streaming fashion)? Thanks!


I saw another example [2] in which all the Arrow RB are being converted to 
Spark Dataframe but my case is little bit different than this.


[1] https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html

[2] https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5

---
Regards,
Tanveer Ahmad

Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversion in streaming fashion

Reply via email to