Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

ayan guha Wed, 26 Jul 2017 12:54:40 -0700

Hi TD

I thought structured streaming does provide similar concept of dataframes
where it does not matter which language I use to invoke the APIs, with
exception of udf.


So, when I think of support foreach sink in python, I think it as just a
wrapper api and data should remain in JVM only. Similar to, for example, a
hive writer or hdfs writer in Dataframe API.

Am I too simplifying? Or is it just early days in structured streaming?
Happy to learn any mistakes in my thinking and understanding.

Best
Ayan

On Thu, 27 Jul 2017 at 4:49 am, Priyank Shrivastava <priy...@asperasoft.com>
wrote:

> Thanks TD.  I am going to try the python-scala hybrid approach by using
> scala only for custom redis sink and python for the rest of the app .  I
> understand it might not be as efficient as purely writing the app in scala
> but unfortunately I am constrained on scala resources.  Have you come
> across other use cases where people have resided to such python-scala
> hybrid approach?
>
> Regards,
> Priyank
>
>
>
> On Wed, Jul 26, 2017 at 1:46 AM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Hello Priyank
>>
>> Writing something purely in Scale/Java would be the most efficient. Even
>> if we expose python APIs that allow writing custom sinks in pure Python, it
>> wont be as efficient as Scala/Java foreach as the data would have to go
>> through JVM / PVM boundary which has significant overheads. So Scala/Java
>> foreach is always going to be the best option.
>>
>> TD
>>
>> On Tue, Jul 25, 2017 at 6:05 PM, Priyank Shrivastava <
>> priy...@asperasoft.com> wrote:
>>
>>> I am trying to write key-values to redis using a DataStreamWriter object
>>> using pyspark structured streaming APIs. I am using Spark 2.2
>>>
>>> Since the Foreach Sink is not supported for python; here
>>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach>,
>>> I am trying to find out some alternatives.
>>>
>>> One alternative is to write a separate Scala module only to push data
>>> into redis using foreach; ForeachWriter
>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter>
>>>  is
>>> supported in Scala. BUT this doesn't seem like an efficient approach and
>>> adds deployment overhead because now I will have to support Scala in my app.
>>>
>>> Another approach is obviously to use Scala instead of python, which is
>>> fine but I want to make sure that I absolutely cannot use python for this
>>> problem before I take this path.
>>>
>>> Would appreciate some feedback and alternative design approaches for
>>> this problem.
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>
> --
Best Regards,
Ayan Guha

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

Reply via email to