Thanks Mich. That looks good. On Sun, Jan 2, 2022 at 7:10 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> LOL. > > You asking these questions takes me back to summer 2016 when I started > writing notes on spark. Obviously earlier versions but the notion of RDD, > Local, standalone, YARN etc. are still valid. Those days there were no k8s > and the public cloud was not widely adopted. I browsed it and it was > refreshing for me. Anyway you may find some points addressing your > questions that you tend to ask. > > HTH > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sun, 2 Jan 2022 at 00:20, Bitfox <bit...@bitfox.top> wrote: > >> One more question, for this big filter, given my server has 4 Cores, will >> spark (standalone mode) split the RDD to 4 partitions automatically? >> >> Thanks >> >> On Sun, Jan 2, 2022 at 6:30 AM Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> Create a list of values that you don't want anf filter oon those >>> >>> >>> DF = spark.range(10) >>> >>> DF >>> DataFrame[id: bigint] >>> >>> >>> >>> array = [1, 2, 3, 8] # don't want these >>> >>> DF.filter(DF.id.isin(array) == False).show() >>> +---+ >>> | id| >>> +---+ >>> | 0| >>> | 4| >>> | 5| >>> | 6| >>> | 7| >>> | 9| >>> +---+ >>> >>> or use binary NOT operator: >>> >>> >>> >>> DF.filter(*~*DF.id.isin(array)).show() >>> >>> +---+ >>> >>> | id| >>> >>> +---+ >>> >>> | 0| >>> >>> | 4| >>> >>> | 5| >>> >>> | 6| >>> >>> | 7| >>> >>> | 9| >>> >>> +---+ >>> >>> >>> HTH >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Sat, 1 Jan 2022 at 20:59, Bitfox <bit...@bitfox.top> wrote: >>> >>>> Using the dataframe API I need to implement a batch filter: >>>> >>>> DF. select(..).where(col(..) != ‘a’ and col(..) != ‘b’ and …) >>>> >>>> There are a lot of keywords should be filtered for the same column in >>>> where statement. >>>> >>>> How can I make it more smater? UDF or others? >>>> >>>> Thanks & Happy new Year! >>>> Bitfox >>>> >>>