Re: convert java dataframe to pyspark dataframe

Khalid Mammadov Tue, 30 Mar 2021 15:08:07 -0700

Hi Aditya,

I think you original question was as how to convert a DataFrame fromSpark session created on Java/Scala to a DataFrame on a Spark sessioncreated from Python(PySpark).


So, as I have answered on your SO question:


There is a missing call to *entry_point* before calling getDf() in your code

So, try this :

|app = gateway.entry_point j_df = app.getDf() |

Additionally, I have create working copy using Python and Scala (hopeyou dont mind) below that shows how on Scala side py4j gateway isstarted with Spark session and a sample DataFrame and on Python side Ihave accessed that DataFrame and converted to Python List[Tuple] beforeconverting back to a DataFrame for a Spark session on Python side:


*Python:*

|from py4j.java_gateway import JavaGateway from pyspark.sql importSparkSession from pyspark.sql.types import StructType, IntegerType,StructField if __name__ == '__main__': gateway = JavaGateway() spark_app= gateway.entry_point df = spark_app.df() # Note "apply" method herecomes from Scala's companion object to access elements of an arraydf_to_list_tuple = [(int(i.apply(0)), int(i.apply(1))) for i in df]spark = (SparkSession .builder .appName("My PySpark App").getOrCreate()) schema = StructType([ StructField("a", IntegerType(),True), StructField("b", IntegerType(), True)]) df =spark.createDataFrame(df_to_list_tuple, schema) df.show() |


*Scala:*

|import java.nio.file.{Path, Paths} importorg.apache.spark.sql.SparkSession import py4j.GatewayServer objectSparkApp { val myFile: Path = Paths.get(System.getProperty("user.home")+ "/dev/sample_data/games.csv") val spark = SparkSession.builder().master("local[*]") .appName("My app") .getOrCreate() val df = spark.read .option("header", "True") .csv(myFile.toString) .collect() }object Py4JServerApp extends App { val server = newGatewayServer(SparkApp) server.start() print("Started and running...") } |



Regards,
Khalid


On 30/03/2021 07:57, Aditya Singh wrote:

Hi Sean,
Thanks a lot for replying and apologies for the late reply(I somehowmissed this mail before) but I am under the impression that passingthe py4j.java_gateway.JavaGateway object lets the pyspark access thespark context created on the java side.My use case is exactly what you mentioned in the last email. I want toaccess the same spark session across java and pyspark. So how can weshare the spark context and in turn spark session, across java andpyspark.
Regards,
Aditya
On Fri, 26 Mar 2021 at 6:49 PM, Sean Owen <sro...@gmail.com<mailto:sro...@gmail.com>> wrote:
    The problem is that both of these are not sharing a SparkContext
    as far as I can see, so there is no way to share the object across
    them, let alone languages.

    You can of course write the data from Java, read it from Python.

    In some hosted Spark products, you can access the same session
    from two languages and register the DataFrame as a temp view in
    Java, then access it in Pyspark.


    On Fri, Mar 26, 2021 at 8:14 AM Aditya Singh
    <aditya.singh9...@gmail.com <mailto:aditya.singh9...@gmail.com>>
    wrote:

        Hi All,

        I am a newbie to spark and trying to pass a java dataframe to
        pyspark. Foloowing link has details about what I am trying to do:-

        
https://stackoverflow.com/questions/66797382/creating-pysparks-spark-context-py4j-java-gateway-object
        
<https://stackoverflow.com/questions/66797382/creating-pysparks-spark-context-py4j-java-gateway-object>

        Can someone please help me with this?

        Thanks,

Re: convert java dataframe to pyspark dataframe

Reply via email to