Thanks Burak. In a streaming context would I need to do any state management for the temp views? for example across sliding windows.
Priyank On Fri, Jul 28, 2017 at 3:13 PM, Burak Yavuz <brk...@gmail.com> wrote: > Hi Priyank, > > You may register them as temporary tables to use across language > boundaries. > > Python: > df = spark.readStream... > # Python logic > df.createOrReplaceTempView("tmp1") > > Scala: > val df = spark.table("tmp1") > df.writeStream > .foreach(...) > > > On Fri, Jul 28, 2017 at 3:06 PM, Priyank Shrivastava < > priy...@asperasoft.com> wrote: > >> TD, >> >> For a hybrid python-scala approach, what's the recommended way of handing >> off a dataframe from python to scala. I would like to know especially in a >> streaming context. >> >> I am not using notebooks/databricks. We are running it on our own spark >> 2.1 cluster. >> >> Priyank >> >> On Wed, Jul 26, 2017 at 12:49 PM, Tathagata Das < >> tathagata.das1...@gmail.com> wrote: >> >>> We see that all the time. For example, in SQL, people can write their >>> user-defined function in Scala/Java and use it from SQL/python/anywhere. >>> That is the recommended way to get the best combo of performance and >>> ease-of-use from non-jvm languages. >>> >>> On Wed, Jul 26, 2017 at 11:49 AM, Priyank Shrivastava < >>> priy...@asperasoft.com> wrote: >>> >>>> Thanks TD. I am going to try the python-scala hybrid approach by using >>>> scala only for custom redis sink and python for the rest of the app . I >>>> understand it might not be as efficient as purely writing the app in scala >>>> but unfortunately I am constrained on scala resources. Have you come >>>> across other use cases where people have resided to such python-scala >>>> hybrid approach? >>>> >>>> Regards, >>>> Priyank >>>> >>>> >>>> >>>> On Wed, Jul 26, 2017 at 1:46 AM, Tathagata Das < >>>> tathagata.das1...@gmail.com> wrote: >>>> >>>>> Hello Priyank >>>>> >>>>> Writing something purely in Scale/Java would be the most efficient. >>>>> Even if we expose python APIs that allow writing custom sinks in pure >>>>> Python, it wont be as efficient as Scala/Java foreach as the data would >>>>> have to go through JVM / PVM boundary which has significant overheads. So >>>>> Scala/Java foreach is always going to be the best option. >>>>> >>>>> TD >>>>> >>>>> On Tue, Jul 25, 2017 at 6:05 PM, Priyank Shrivastava < >>>>> priy...@asperasoft.com> wrote: >>>>> >>>>>> I am trying to write key-values to redis using a DataStreamWriter >>>>>> object using pyspark structured streaming APIs. I am using Spark 2.2 >>>>>> >>>>>> Since the Foreach Sink is not supported for python; here >>>>>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach>, >>>>>> I am trying to find out some alternatives. >>>>>> >>>>>> One alternative is to write a separate Scala module only to push data >>>>>> into redis using foreach; ForeachWriter >>>>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter> >>>>>> is >>>>>> supported in Scala. BUT this doesn't seem like an efficient approach and >>>>>> adds deployment overhead because now I will have to support Scala in my >>>>>> app. >>>>>> >>>>>> Another approach is obviously to use Scala instead of python, which >>>>>> is fine but I want to make sure that I absolutely cannot use python for >>>>>> this problem before I take this path. >>>>>> >>>>>> Would appreciate some feedback and alternative design approaches for >>>>>> this problem. >>>>>> >>>>>> Thanks. >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >