Re: States get dropped in Structured Streaming

2020-10-23 Thread Jungtaek Lim
Unfortunately your information wouldn't provide any hint that rows in the
state are evicted correctly on watermark advance or there's an unknown bug
which some of the rows in state are silently dropped. I haven't heard of
the case for the latter - probably you'd like to double check it with
focusing on watermark advance. If the case is turned out to be the latter,
you'll probably need to deal with Spark code to inject the debug log.

On Fri, Oct 23, 2020 at 3:12 PM Eric Beabes 
wrote:

> We're using Stateful Structured Streaming in Spark 2.4. We are noticing
> that when the load on the system is heavy & LOTs of messages are coming in
> some of the states disappear with no error message. Any suggestions on how
> we can debug this? Any tips for fixing this?
>
> Thanks in advance.
>


Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Sofia’s World
Hey
 My 2 cents on CI/Cd for pyspark. You can leverage pytests + holden karau's
spark testing libs for CI  thus giving you `almost` same functionality as
Scala - I say almost as in Scala you have nice and descriptive funcspecs -

For me choice is based on expertise.having worked with teams which are 99%
python..the cost of retraining -or even hiring - is too big especially if
you have an existing project and aggressive deadlines
Plz feel free to object
Kind Regards

On Fri, Oct 23, 2020, 1:01 PM William R  wrote:

> It's really a very big discussion around Pyspark Vs Scala. I have little
> bit experience about how we can automate the CI/CD when it's a JVM based
> language.
> I would like to take this as an opportunity to understand the end-to-end
> CI/CD flow for Pyspark based ETL pipelines.
>
> Could someone please list down the steps how the pipeline automation works
> when it comes to Pyspark based pipelines in Production ?
>
> //William
>
> On Fri, Oct 23, 2020 at 11:24 AM Wim Van Leuven <
> wim.vanleu...@highestpoint.biz> wrote:
>
>> I think Sean is right, but in your argumentation you mention that 
>> 'functionality
>> is sacrificed in favour of the availability of resources'. That's where I
>> disagree with you but agree with Sean. That is mostly not true.
>>
>> In your previous posts you also mentioned this . The only reason we
>> sometimes have to bail out to Scala is for performance with certain udfs
>>
>> On Thu, 22 Oct 2020 at 23:11, Mich Talebzadeh 
>> wrote:
>>
>>> Thanks for the feedback Sean.
>>>
>>> Kind regards,
>>>
>>> Mich
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 22 Oct 2020 at 20:34, Sean Owen  wrote:
>>>
 I don't find this trolling; I agree with the observation that 'the
 skills you have' are a valid and important determiner of what tools you
 pick.
 I disagree that you just have to pick the optimal tool for everything.
 Sounds good until that comes in contact with the real world.
 For Spark, Python vs Scala just doesn't matter a lot, especially if
 you're doing DataFrame operations. By design. So I can't see there being
 one answer to this.

 On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta <
 gourav.sengu...@gmail.com> wrote:

> Hi Mich,
>
> this is turning into a troll now, can you please stop this?
>
> No one uses Scala where Python should be used, and no one uses Python
> where Scala should be used - it all depends on requirements. Everyone
> understands polyglot programming and how to use relevant technologies best
> to their advantage.
>
>
> Regards,
> Gourav Sengupta
>
>
>>>
>
> --
> Regards,
> William R
> +919037075164
>
>
>


Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Mich Talebzadeh
Hi Wim,


I think we are splitting the atom here but my inference to functionality
was based on:



   1.  Spark is written in Scala, so knowing Scala programming language
   helps coders navigate into the source code, if something does not function
   as expected.
   2. Given the framework using Python increases the probability for more
   issues and bugs because translation between these two different languages
   is difficult.
   3. Using Scala for Spark provides access to the latest features of the
   Spark framework as they are first available in Scala and then ported to
   Python.
   4. Some functionalities are not available in Python. I have seen this
   few times in Spark doc.

There is an interesting write-up on this, although it does on touch on
CI/CD aspects.


 Developing Apache Spark Applications: Scala vs. Python



Regards,


Mich



On Fri, 23 Oct 2020 at 10:23, Wim Van Leuven 
wrote:

> I think Sean is right, but in your argumentation you mention that 
> 'functionality
> is sacrificed in favour of the availability of resources'. That's where I
> disagree with you but agree with Sean. That is mostly not true.
>
> In your previous posts you also mentioned this . The only reason we
> sometimes have to bail out to Scala is for performance with certain udfs
>
> On Thu, 22 Oct 2020 at 23:11, Mich Talebzadeh 
> wrote:
>
>> Thanks for the feedback Sean.
>>
>> Kind regards,
>>
>> Mich
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 22 Oct 2020 at 20:34, Sean Owen  wrote:
>>
>>> I don't find this trolling; I agree with the observation that 'the
>>> skills you have' are a valid and important determiner of what tools you
>>> pick.
>>> I disagree that you just have to pick the optimal tool for everything.
>>> Sounds good until that comes in contact with the real world.
>>> For Spark, Python vs Scala just doesn't matter a lot, especially if
>>> you're doing DataFrame operations. By design. So I can't see there being
>>> one answer to this.
>>>
>>> On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
 Hi Mich,

 this is turning into a troll now, can you please stop this?

 No one uses Scala where Python should be used, and no one uses Python
 where Scala should be used - it all depends on requirements. Everyone
 understands polyglot programming and how to use relevant technologies best
 to their advantage.


 Regards,
 Gourav Sengupta


>>


Need help on Calling Pyspark code using Wheel

2020-10-23 Thread Sachit Murarka
Hi Users,

I have created a wheel file using Poetry. I tried running the following
commands to run spark job using wheel , but it is not working. Can anyone
please let me know about the invocation step for the wheel file?

spark-submit --py-files /path/to/wheel
spark-submit --files /path/to/wheel

Thanks
Sachit

Kind Regards,
Sachit Murarka


Re: Scala vs Python for ETL with Spark

2020-10-23 Thread William R
It's really a very big discussion around Pyspark Vs Scala. I have little
bit experience about how we can automate the CI/CD when it's a JVM based
language.
I would like to take this as an opportunity to understand the end-to-end
CI/CD flow for Pyspark based ETL pipelines.

Could someone please list down the steps how the pipeline automation works
when it comes to Pyspark based pipelines in Production ?

//William

On Fri, Oct 23, 2020 at 11:24 AM Wim Van Leuven <
wim.vanleu...@highestpoint.biz> wrote:

> I think Sean is right, but in your argumentation you mention that 
> 'functionality
> is sacrificed in favour of the availability of resources'. That's where I
> disagree with you but agree with Sean. That is mostly not true.
>
> In your previous posts you also mentioned this . The only reason we
> sometimes have to bail out to Scala is for performance with certain udfs
>
> On Thu, 22 Oct 2020 at 23:11, Mich Talebzadeh 
> wrote:
>
>> Thanks for the feedback Sean.
>>
>> Kind regards,
>>
>> Mich
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 22 Oct 2020 at 20:34, Sean Owen  wrote:
>>
>>> I don't find this trolling; I agree with the observation that 'the
>>> skills you have' are a valid and important determiner of what tools you
>>> pick.
>>> I disagree that you just have to pick the optimal tool for everything.
>>> Sounds good until that comes in contact with the real world.
>>> For Spark, Python vs Scala just doesn't matter a lot, especially if
>>> you're doing DataFrame operations. By design. So I can't see there being
>>> one answer to this.
>>>
>>> On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
 Hi Mich,

 this is turning into a troll now, can you please stop this?

 No one uses Scala where Python should be used, and no one uses Python
 where Scala should be used - it all depends on requirements. Everyone
 understands polyglot programming and how to use relevant technologies best
 to their advantage.


 Regards,
 Gourav Sengupta


>>

-- 
Regards,
William R
+919037075164


Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Wim Van Leuven
I think Sean is right, but in your argumentation you mention that
'functionality
is sacrificed in favour of the availability of resources'. That's where I
disagree with you but agree with Sean. That is mostly not true.

In your previous posts you also mentioned this . The only reason we
sometimes have to bail out to Scala is for performance with certain udfs

On Thu, 22 Oct 2020 at 23:11, Mich Talebzadeh 
wrote:

> Thanks for the feedback Sean.
>
> Kind regards,
>
> Mich
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 22 Oct 2020 at 20:34, Sean Owen  wrote:
>
>> I don't find this trolling; I agree with the observation that 'the skills
>> you have' are a valid and important determiner of what tools you pick.
>> I disagree that you just have to pick the optimal tool for everything.
>> Sounds good until that comes in contact with the real world.
>> For Spark, Python vs Scala just doesn't matter a lot, especially if
>> you're doing DataFrame operations. By design. So I can't see there being
>> one answer to this.
>>
>> On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi Mich,
>>>
>>> this is turning into a troll now, can you please stop this?
>>>
>>> No one uses Scala where Python should be used, and no one uses Python
>>> where Scala should be used - it all depends on requirements. Everyone
>>> understands polyglot programming and how to use relevant technologies best
>>> to their advantage.
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>>
>