Re: convert java dataframe to pyspark dataframe

Aditya Singh Wed, 31 Mar 2021 11:33:55 -0700

Thanks a lot, this was really helpful.

On Wed, 31 Mar 2021 at 4:13 PM, Khalid Mammadov <khalidmammad...@gmail.com>
wrote:


> I think what you want to achieve is what PySpark is actually doing in it's
> API under the hood.
>
> So, specifically you need to look at PySpark's implementation of
> DataFrame, SparkSession and SparkContext API. Under the hood that what is
> happening, it start a py4j gateway and delegates all Spark operations and
> objects creation to JVM.
>
> Look for example here
> <https://github.com/apache/spark/blob/48ef9bd2b3a0cb52aaa31a3fada8779b7a7b9132/python/pyspark/context.py#L209>,
> here
> <https://github.com/apache/spark/blob/48ef9bd2b3a0cb52aaa31a3fada8779b7a7b9132/python/pyspark/sql/session.py#L248>
> and here
> <https://github.com/apache/spark/blob/48ef9bd2b3a0cb52aaa31a3fada8779b7a7b9132/python/pyspark/java_gateway.py#L153>where
> SparkSession/SparkContext (Python) communicates with JVM and creates
> SparkSession/SparkContext on JVM side. And rest of the PySpark code will be
> delegating execution to them.
>
> Having these objects created by you and custom Java/Scala application
> which holds Spark objects then you can play around with rest of DataFrame
> creation and passing back and forward. But, I must admit, this is not part
> of official documentation and playing around internals of Spark and which
> are subject to change (often).
>
> So, I am not sure what your actual requirement is but you will need to
> implement your custom version of PySpark API to get all functionality you
> need and control on JVM side.
>
>
> On 31/03/2021 06:49, Aditya Singh wrote:
>
> Thanks a lot Khalid for replying.
>
> I have one question though. The approach tou showed needs an understanding
> on python side before hand about the data type of columns of dataframe. Can
> we implement a generic approach where this info is not required and we just
> have the java dataframe as input on python side?
>
> Also one more question, in my use case I will sending a dataframe from
> java to python and then on python side there will be some transformation
> done on the dataframe(including using python udfs) but no actions will be
> performed here and then will send it back to java where actions will be
> performed. So also wanted to ask if this is feasible and if yes do we need
> to send some special jars to executors so that it can execute udfs on the
> dataframe.
>
> On Wed, 31 Mar 2021 at 3:37 AM, Khalid Mammadov <khalidmammad...@gmail.com>
> wrote:
>
>> Hi Aditya,
>>
>>
>> I think you original question was as how to convert a DataFrame from
>> Spark session created on Java/Scala to a DataFrame on a Spark session
>> created from Python(PySpark).
>>
>> So, as I have answered on your SO question:
>>
>>
>> There is a missing call to *entry_point* before calling getDf() in your
>> code
>>
>> So, try this :
>>
>> app = gateway.entry_point
>> j_df = app.getDf()
>>
>> Additionally, I have create working copy using Python and Scala (hope you
>> dont mind) below that shows how on Scala side py4j gateway is started with
>> Spark session and a sample DataFrame and on Python side I have accessed
>> that DataFrame and converted to Python List[Tuple] before converting back
>> to a DataFrame for a Spark session on Python side:
>>
>> *Python:*
>>
>> from py4j.java_gateway import JavaGatewayfrom pyspark.sql import 
>> SparkSessionfrom pyspark.sql.types import StructType, IntegerType, 
>> StructField
>> if __name__ == '__main__':
>>     gateway = JavaGateway()
>>
>>     spark_app = gateway.entry_point
>>     df = spark_app.df()
>>
>>     # Note "apply" method here comes from Scala's companion object to access 
>> elements of an array
>>     df_to_list_tuple = [(int(i.apply(0)), int(i.apply(1))) for i in df]
>>
>>     spark = (SparkSession
>>              .builder
>>              .appName("My PySpark App")
>>              .getOrCreate())
>>
>>     schema = StructType([
>>         StructField("a", IntegerType(), True),
>>         StructField("b", IntegerType(), True)])
>>
>>     df = spark.createDataFrame(df_to_list_tuple, schema)
>>
>>     df.show()
>>
>>
>> *Scala:*
>>
>> import java.nio.file.{Path, Paths}
>> import org.apache.spark.sql.SparkSessionimport py4j.GatewayServer
>> object SparkApp {
>>   val myFile: Path = Paths.get(System.getProperty("user.home") + 
>> "/dev/sample_data/games.csv")
>>   val spark = SparkSession.builder()
>>     .master("local[*]")
>>     .appName("My app")
>>     .getOrCreate()
>>
>>   val df = spark
>>       .read
>>       .option("header", "True")
>>       .csv(myFile.toString)
>>       .collect()
>>
>> }
>> object Py4JServerApp extends App {
>>
>>
>>   val server = new GatewayServer(SparkApp)
>>   server.start()
>>
>>   print("Started and running...")
>> }
>>
>>
>>
>>
>>
>> Regards,
>> Khalid
>>
>>
>> On 30/03/2021 07:57, Aditya Singh wrote:
>>
>> Hi Sean,
>>
>> Thanks a lot for replying and apologies for the late reply(I somehow
>> missed this mail before) but I am under the impression that passing the
>> py4j.java_gateway.JavaGateway object lets the pyspark access the spark
>> context created on the java side.
>> My use case is exactly what you mentioned in the last email. I want to
>> access the same spark session across java and pyspark. So how can we share
>> the spark context and in turn spark session, across java and pyspark.
>>
>> Regards,
>> Aditya
>>
>> On Fri, 26 Mar 2021 at 6:49 PM, Sean Owen <sro...@gmail.com> wrote:
>>
>>> The problem is that both of these are not sharing a SparkContext as far
>>> as I can see, so there is no way to share the object across them, let alone
>>> languages.
>>>
>>> You can of course write the data from Java, read it from Python.
>>>
>>> In some hosted Spark products, you can access the same session from two
>>> languages and register the DataFrame as a temp view in Java, then access it
>>> in Pyspark.
>>>
>>>
>>> On Fri, Mar 26, 2021 at 8:14 AM Aditya Singh <aditya.singh9...@gmail.com>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I am a newbie to spark and trying to pass a java dataframe to pyspark.
>>>> Foloowing link has details about what I am trying to do:-
>>>>
>>>>
>>>> https://stackoverflow.com/questions/66797382/creating-pysparks-spark-context-py4j-java-gateway-object
>>>>
>>>> Can someone please help me with this?
>>>>
>>>> Thanks,
>>>>
>>>

Re: convert java dataframe to pyspark dataframe

Reply via email to