Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

Priyank Shrivastava Fri, 28 Jul 2017 15:06:59 -0700

TD,

For a hybrid python-scala approach, what's the recommended way of handing
off a dataframe from python to scala.  I would like to know especially in a
streaming context.


I am not using notebooks/databricks.  We are running it on our own spark
2.1 cluster.

Priyank

On Wed, Jul 26, 2017 at 12:49 PM, Tathagata Das <tathagata.das1...@gmail.com
> wrote:

> We see that all the time. For example, in SQL, people can write their
> user-defined function in Scala/Java and use it from SQL/python/anywhere.
> That is the recommended way to get the best combo of performance and
> ease-of-use from non-jvm languages.
>
> On Wed, Jul 26, 2017 at 11:49 AM, Priyank Shrivastava <
> priy...@asperasoft.com> wrote:
>
>> Thanks TD.  I am going to try the python-scala hybrid approach by using
>> scala only for custom redis sink and python for the rest of the app .  I
>> understand it might not be as efficient as purely writing the app in scala
>> but unfortunately I am constrained on scala resources.  Have you come
>> across other use cases where people have resided to such python-scala
>> hybrid approach?
>>
>> Regards,
>> Priyank
>>
>>
>>
>> On Wed, Jul 26, 2017 at 1:46 AM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Hello Priyank
>>>
>>> Writing something purely in Scale/Java would be the most efficient. Even
>>> if we expose python APIs that allow writing custom sinks in pure Python, it
>>> wont be as efficient as Scala/Java foreach as the data would have to go
>>> through JVM / PVM boundary which has significant overheads. So Scala/Java
>>> foreach is always going to be the best option.
>>>
>>> TD
>>>
>>> On Tue, Jul 25, 2017 at 6:05 PM, Priyank Shrivastava <
>>> priy...@asperasoft.com> wrote:
>>>
>>>> I am trying to write key-values to redis using a DataStreamWriter
>>>> object using pyspark structured streaming APIs. I am using Spark 2.2
>>>>
>>>> Since the Foreach Sink is not supported for python; here
>>>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach>,
>>>> I am trying to find out some alternatives.
>>>>
>>>> One alternative is to write a separate Scala module only to push data
>>>> into redis using foreach; ForeachWriter
>>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter>
>>>>  is
>>>> supported in Scala. BUT this doesn't seem like an efficient approach and
>>>> adds deployment overhead because now I will have to support Scala in my 
>>>> app.
>>>>
>>>> Another approach is obviously to use Scala instead of python, which is
>>>> fine but I want to make sure that I absolutely cannot use python for this
>>>> problem before I take this path.
>>>>
>>>> Would appreciate some feedback and alternative design approaches for
>>>> this problem.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

Reply via email to