TD, For a hybrid python-scala approach, what's the recommended way of handing off a dataframe from python to scala. I would like to know especially in a streaming context.
I am not using notebooks/databricks. We are running it on our own spark 2.1 cluster. Priyank On Wed, Jul 26, 2017 at 12:49 PM, Tathagata Das <tathagata.das1...@gmail.com > wrote: > We see that all the time. For example, in SQL, people can write their > user-defined function in Scala/Java and use it from SQL/python/anywhere. > That is the recommended way to get the best combo of performance and > ease-of-use from non-jvm languages. > > On Wed, Jul 26, 2017 at 11:49 AM, Priyank Shrivastava < > priy...@asperasoft.com> wrote: > >> Thanks TD. I am going to try the python-scala hybrid approach by using >> scala only for custom redis sink and python for the rest of the app . I >> understand it might not be as efficient as purely writing the app in scala >> but unfortunately I am constrained on scala resources. Have you come >> across other use cases where people have resided to such python-scala >> hybrid approach? >> >> Regards, >> Priyank >> >> >> >> On Wed, Jul 26, 2017 at 1:46 AM, Tathagata Das < >> tathagata.das1...@gmail.com> wrote: >> >>> Hello Priyank >>> >>> Writing something purely in Scale/Java would be the most efficient. Even >>> if we expose python APIs that allow writing custom sinks in pure Python, it >>> wont be as efficient as Scala/Java foreach as the data would have to go >>> through JVM / PVM boundary which has significant overheads. So Scala/Java >>> foreach is always going to be the best option. >>> >>> TD >>> >>> On Tue, Jul 25, 2017 at 6:05 PM, Priyank Shrivastava < >>> priy...@asperasoft.com> wrote: >>> >>>> I am trying to write key-values to redis using a DataStreamWriter >>>> object using pyspark structured streaming APIs. I am using Spark 2.2 >>>> >>>> Since the Foreach Sink is not supported for python; here >>>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach>, >>>> I am trying to find out some alternatives. >>>> >>>> One alternative is to write a separate Scala module only to push data >>>> into redis using foreach; ForeachWriter >>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter> >>>> is >>>> supported in Scala. BUT this doesn't seem like an efficient approach and >>>> adds deployment overhead because now I will have to support Scala in my >>>> app. >>>> >>>> Another approach is obviously to use Scala instead of python, which is >>>> fine but I want to make sure that I absolutely cannot use python for this >>>> problem before I take this path. >>>> >>>> Would appreciate some feedback and alternative design approaches for >>>> this problem. >>>> >>>> Thanks. >>>> >>>> >>>> >>>> >>> >> >