Hi TD I thought structured streaming does provide similar concept of dataframes where it does not matter which language I use to invoke the APIs, with exception of udf.
So, when I think of support foreach sink in python, I think it as just a wrapper api and data should remain in JVM only. Similar to, for example, a hive writer or hdfs writer in Dataframe API. Am I too simplifying? Or is it just early days in structured streaming? Happy to learn any mistakes in my thinking and understanding. Best Ayan On Thu, 27 Jul 2017 at 4:49 am, Priyank Shrivastava <priy...@asperasoft.com> wrote: > Thanks TD. I am going to try the python-scala hybrid approach by using > scala only for custom redis sink and python for the rest of the app . I > understand it might not be as efficient as purely writing the app in scala > but unfortunately I am constrained on scala resources. Have you come > across other use cases where people have resided to such python-scala > hybrid approach? > > Regards, > Priyank > > > > On Wed, Jul 26, 2017 at 1:46 AM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> Hello Priyank >> >> Writing something purely in Scale/Java would be the most efficient. Even >> if we expose python APIs that allow writing custom sinks in pure Python, it >> wont be as efficient as Scala/Java foreach as the data would have to go >> through JVM / PVM boundary which has significant overheads. So Scala/Java >> foreach is always going to be the best option. >> >> TD >> >> On Tue, Jul 25, 2017 at 6:05 PM, Priyank Shrivastava < >> priy...@asperasoft.com> wrote: >> >>> I am trying to write key-values to redis using a DataStreamWriter object >>> using pyspark structured streaming APIs. I am using Spark 2.2 >>> >>> Since the Foreach Sink is not supported for python; here >>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach>, >>> I am trying to find out some alternatives. >>> >>> One alternative is to write a separate Scala module only to push data >>> into redis using foreach; ForeachWriter >>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter> >>> is >>> supported in Scala. BUT this doesn't seem like an efficient approach and >>> adds deployment overhead because now I will have to support Scala in my app. >>> >>> Another approach is obviously to use Scala instead of python, which is >>> fine but I want to make sure that I absolutely cannot use python for this >>> problem before I take this path. >>> >>> Would appreciate some feedback and alternative design approaches for >>> this problem. >>> >>> Thanks. >>> >>> >>> >>> >> > -- Best Regards, Ayan Guha