Hi Priyank, You may register them as temporary tables to use across language boundaries.
Python: df = spark.readStream... # Python logic df.createOrReplaceTempView("tmp1") Scala: val df = spark.table("tmp1") df.writeStream .foreach(...) On Fri, Jul 28, 2017 at 3:06 PM, Priyank Shrivastava <priy...@asperasoft.com > wrote: > TD, > > For a hybrid python-scala approach, what's the recommended way of handing > off a dataframe from python to scala. I would like to know especially in a > streaming context. > > I am not using notebooks/databricks. We are running it on our own spark > 2.1 cluster. > > Priyank > > On Wed, Jul 26, 2017 at 12:49 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> We see that all the time. For example, in SQL, people can write their >> user-defined function in Scala/Java and use it from SQL/python/anywhere. >> That is the recommended way to get the best combo of performance and >> ease-of-use from non-jvm languages. >> >> On Wed, Jul 26, 2017 at 11:49 AM, Priyank Shrivastava < >> priy...@asperasoft.com> wrote: >> >>> Thanks TD. I am going to try the python-scala hybrid approach by using >>> scala only for custom redis sink and python for the rest of the app . I >>> understand it might not be as efficient as purely writing the app in scala >>> but unfortunately I am constrained on scala resources. Have you come >>> across other use cases where people have resided to such python-scala >>> hybrid approach? >>> >>> Regards, >>> Priyank >>> >>> >>> >>> On Wed, Jul 26, 2017 at 1:46 AM, Tathagata Das < >>> tathagata.das1...@gmail.com> wrote: >>> >>>> Hello Priyank >>>> >>>> Writing something purely in Scale/Java would be the most efficient. >>>> Even if we expose python APIs that allow writing custom sinks in pure >>>> Python, it wont be as efficient as Scala/Java foreach as the data would >>>> have to go through JVM / PVM boundary which has significant overheads. So >>>> Scala/Java foreach is always going to be the best option. >>>> >>>> TD >>>> >>>> On Tue, Jul 25, 2017 at 6:05 PM, Priyank Shrivastava < >>>> priy...@asperasoft.com> wrote: >>>> >>>>> I am trying to write key-values to redis using a DataStreamWriter >>>>> object using pyspark structured streaming APIs. I am using Spark 2.2 >>>>> >>>>> Since the Foreach Sink is not supported for python; here >>>>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach>, >>>>> I am trying to find out some alternatives. >>>>> >>>>> One alternative is to write a separate Scala module only to push data >>>>> into redis using foreach; ForeachWriter >>>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter> >>>>> is >>>>> supported in Scala. BUT this doesn't seem like an efficient approach and >>>>> adds deployment overhead because now I will have to support Scala in my >>>>> app. >>>>> >>>>> Another approach is obviously to use Scala instead of python, which is >>>>> fine but I want to make sure that I absolutely cannot use python for this >>>>> problem before I take this path. >>>>> >>>>> Would appreciate some feedback and alternative design approaches for >>>>> this problem. >>>>> >>>>> Thanks. >>>>> >>>>> >>>>> >>>>> >>>> >>> >> >