Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

Priyank Shrivastava Fri, 28 Jul 2017 15:23:47 -0700

Also, in your example doesn't the tempview need to be accessed using the
same sparkSession on the scala side?  Since I am not using a notebook, how
can I get access to the same sparksession in scala.


On Fri, Jul 28, 2017 at 3:17 PM, Priyank Shrivastava <priy...@asperasoft.com
> wrote:

> Thanks Burak.
>
> In a streaming context would I need to do any state management for the
> temp views? for example across sliding windows.
>
> Priyank
>
> On Fri, Jul 28, 2017 at 3:13 PM, Burak Yavuz <brk...@gmail.com> wrote:
>
>> Hi Priyank,
>>
>> You may register them as temporary tables to use across language
>> boundaries.
>>
>> Python:
>> df = spark.readStream...
>> # Python logic
>> df.createOrReplaceTempView("tmp1")
>>
>> Scala:
>> val df = spark.table("tmp1")
>> df.writeStream
>>   .foreach(...)
>>
>>
>> On Fri, Jul 28, 2017 at 3:06 PM, Priyank Shrivastava <
>> priy...@asperasoft.com> wrote:
>>
>>> TD,
>>>
>>> For a hybrid python-scala approach, what's the recommended way of
>>> handing off a dataframe from python to scala.  I would like to know
>>> especially in a streaming context.
>>>
>>> I am not using notebooks/databricks.  We are running it on our own spark
>>> 2.1 cluster.
>>>
>>> Priyank
>>>
>>> On Wed, Jul 26, 2017 at 12:49 PM, Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
>>>> We see that all the time. For example, in SQL, people can write their
>>>> user-defined function in Scala/Java and use it from SQL/python/anywhere.
>>>> That is the recommended way to get the best combo of performance and
>>>> ease-of-use from non-jvm languages.
>>>>
>>>> On Wed, Jul 26, 2017 at 11:49 AM, Priyank Shrivastava <
>>>> priy...@asperasoft.com> wrote:
>>>>
>>>>> Thanks TD.  I am going to try the python-scala hybrid approach by
>>>>> using scala only for custom redis sink and python for the rest of the app
>>>>> .  I understand it might not be as efficient as purely writing the app in
>>>>> scala but unfortunately I am constrained on scala resources.  Have you 
>>>>> come
>>>>> across other use cases where people have resided to such python-scala
>>>>> hybrid approach?
>>>>>
>>>>> Regards,
>>>>> Priyank
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 26, 2017 at 1:46 AM, Tathagata Das <
>>>>> tathagata.das1...@gmail.com> wrote:
>>>>>
>>>>>> Hello Priyank
>>>>>>
>>>>>> Writing something purely in Scale/Java would be the most efficient.
>>>>>> Even if we expose python APIs that allow writing custom sinks in pure
>>>>>> Python, it wont be as efficient as Scala/Java foreach as the data would
>>>>>> have to go through JVM / PVM boundary which has significant overheads. So
>>>>>> Scala/Java foreach is always going to be the best option.
>>>>>>
>>>>>> TD
>>>>>>
>>>>>> On Tue, Jul 25, 2017 at 6:05 PM, Priyank Shrivastava <
>>>>>> priy...@asperasoft.com> wrote:
>>>>>>
>>>>>>> I am trying to write key-values to redis using a DataStreamWriter
>>>>>>> object using pyspark structured streaming APIs. I am using Spark 2.2
>>>>>>>
>>>>>>> Since the Foreach Sink is not supported for python; here
>>>>>>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach>,
>>>>>>> I am trying to find out some alternatives.
>>>>>>>
>>>>>>> One alternative is to write a separate Scala module only to push
>>>>>>> data into redis using foreach; ForeachWriter
>>>>>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter>
>>>>>>>  is
>>>>>>> supported in Scala. BUT this doesn't seem like an efficient approach and
>>>>>>> adds deployment overhead because now I will have to support Scala in my 
>>>>>>> app.
>>>>>>>
>>>>>>> Another approach is obviously to use Scala instead of python, which
>>>>>>> is fine but I want to make sure that I absolutely cannot use python for
>>>>>>> this problem before I take this path.
>>>>>>>
>>>>>>> Would appreciate some feedback and alternative design approaches for
>>>>>>> this problem.
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

Reply via email to