Hey My 2 cents on CI/Cd for pyspark. You can leverage pytests + holden karau's spark testing libs for CI thus giving you `almost` same functionality as Scala - I say almost as in Scala you have nice and descriptive funcspecs -
For me choice is based on expertise.having worked with teams which are 99% python..the cost of retraining -or even hiring - is too big especially if you have an existing project and aggressive deadlines Plz feel free to object Kind Regards On Fri, Oct 23, 2020, 1:01 PM William R <rspwill...@gmail.com> wrote: > It's really a very big discussion around Pyspark Vs Scala. I have little > bit experience about how we can automate the CI/CD when it's a JVM based > language. > I would like to take this as an opportunity to understand the end-to-end > CI/CD flow for Pyspark based ETL pipelines. > > Could someone please list down the steps how the pipeline automation works > when it comes to Pyspark based pipelines in Production ? > > //William > > On Fri, Oct 23, 2020 at 11:24 AM Wim Van Leuven < > wim.vanleu...@highestpoint.biz> wrote: > >> I think Sean is right, but in your argumentation you mention that >> 'functionality >> is sacrificed in favour of the availability of resources'. That's where I >> disagree with you but agree with Sean. That is mostly not true. >> >> In your previous posts you also mentioned this . The only reason we >> sometimes have to bail out to Scala is for performance with certain udfs >> >> On Thu, 22 Oct 2020 at 23:11, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> Thanks for the feedback Sean. >>> >>> Kind regards, >>> >>> Mich >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Thu, 22 Oct 2020 at 20:34, Sean Owen <sro...@gmail.com> wrote: >>> >>>> I don't find this trolling; I agree with the observation that 'the >>>> skills you have' are a valid and important determiner of what tools you >>>> pick. >>>> I disagree that you just have to pick the optimal tool for everything. >>>> Sounds good until that comes in contact with the real world. >>>> For Spark, Python vs Scala just doesn't matter a lot, especially if >>>> you're doing DataFrame operations. By design. So I can't see there being >>>> one answer to this. >>>> >>>> On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta < >>>> gourav.sengu...@gmail.com> wrote: >>>> >>>>> Hi Mich, >>>>> >>>>> this is turning into a troll now, can you please stop this? >>>>> >>>>> No one uses Scala where Python should be used, and no one uses Python >>>>> where Scala should be used - it all depends on requirements. Everyone >>>>> understands polyglot programming and how to use relevant technologies best >>>>> to their advantage. >>>>> >>>>> >>>>> Regards, >>>>> Gourav Sengupta >>>>> >>>>> >>>>>>> > > -- > Regards, > William R > +919037075164 > > >