Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

Priyank Shrivastava Fri, 28 Jul 2017 15:17:28 -0700

Thanks Burak.

In a streaming context would I need to do any state management for the temp
views? for example across sliding windows.


Priyank

On Fri, Jul 28, 2017 at 3:13 PM, Burak Yavuz <brk...@gmail.com> wrote:

> Hi Priyank,
>
> You may register them as temporary tables to use across language
> boundaries.
>
> Python:
> df = spark.readStream...
> # Python logic
> df.createOrReplaceTempView("tmp1")
>
> Scala:
> val df = spark.table("tmp1")
> df.writeStream
>   .foreach(...)
>
>
> On Fri, Jul 28, 2017 at 3:06 PM, Priyank Shrivastava <
> priy...@asperasoft.com> wrote:
>
>> TD,
>>
>> For a hybrid python-scala approach, what's the recommended way of handing
>> off a dataframe from python to scala.  I would like to know especially in a
>> streaming context.
>>
>> I am not using notebooks/databricks.  We are running it on our own spark
>> 2.1 cluster.
>>
>> Priyank
>>
>> On Wed, Jul 26, 2017 at 12:49 PM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> We see that all the time. For example, in SQL, people can write their
>>> user-defined function in Scala/Java and use it from SQL/python/anywhere.
>>> That is the recommended way to get the best combo of performance and
>>> ease-of-use from non-jvm languages.
>>>
>>> On Wed, Jul 26, 2017 at 11:49 AM, Priyank Shrivastava <
>>> priy...@asperasoft.com> wrote:
>>>
>>>> Thanks TD.  I am going to try the python-scala hybrid approach by using
>>>> scala only for custom redis sink and python for the rest of the app .  I
>>>> understand it might not be as efficient as purely writing the app in scala
>>>> but unfortunately I am constrained on scala resources.  Have you come
>>>> across other use cases where people have resided to such python-scala
>>>> hybrid approach?
>>>>
>>>> Regards,
>>>> Priyank
>>>>
>>>>
>>>>
>>>> On Wed, Jul 26, 2017 at 1:46 AM, Tathagata Das <
>>>> tathagata.das1...@gmail.com> wrote:
>>>>
>>>>> Hello Priyank
>>>>>
>>>>> Writing something purely in Scale/Java would be the most efficient.
>>>>> Even if we expose python APIs that allow writing custom sinks in pure
>>>>> Python, it wont be as efficient as Scala/Java foreach as the data would
>>>>> have to go through JVM / PVM boundary which has significant overheads. So
>>>>> Scala/Java foreach is always going to be the best option.
>>>>>
>>>>> TD
>>>>>
>>>>> On Tue, Jul 25, 2017 at 6:05 PM, Priyank Shrivastava <
>>>>> priy...@asperasoft.com> wrote:
>>>>>
>>>>>> I am trying to write key-values to redis using a DataStreamWriter
>>>>>> object using pyspark structured streaming APIs. I am using Spark 2.2
>>>>>>
>>>>>> Since the Foreach Sink is not supported for python; here
>>>>>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach>,
>>>>>> I am trying to find out some alternatives.
>>>>>>
>>>>>> One alternative is to write a separate Scala module only to push data
>>>>>> into redis using foreach; ForeachWriter
>>>>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter>
>>>>>>  is
>>>>>> supported in Scala. BUT this doesn't seem like an efficient approach and
>>>>>> adds deployment overhead because now I will have to support Scala in my 
>>>>>> app.
>>>>>>
>>>>>> Another approach is obviously to use Scala instead of python, which
>>>>>> is fine but I want to make sure that I absolutely cannot use python for
>>>>>> this problem before I take this path.
>>>>>>
>>>>>> Would appreciate some feedback and alternative design approaches for
>>>>>> this problem.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

Reply via email to