One more question, for this big filter, given my server has 4 Cores, will spark (standalone mode) split the RDD to 4 partitions automatically?
Thanks On Sun, Jan 2, 2022 at 6:30 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Create a list of values that you don't want anf filter oon those > > >>> DF = spark.range(10) > >>> DF > DataFrame[id: bigint] > >>> > >>> array = [1, 2, 3, 8] # don't want these > >>> DF.filter(DF.id.isin(array) == False).show() > +---+ > | id| > +---+ > | 0| > | 4| > | 5| > | 6| > | 7| > | 9| > +---+ > > or use binary NOT operator: > > > >>> DF.filter(*~*DF.id.isin(array)).show() > > +---+ > > | id| > > +---+ > > | 0| > > | 4| > > | 5| > > | 6| > > | 7| > > | 9| > > +---+ > > > HTH > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sat, 1 Jan 2022 at 20:59, Bitfox <bit...@bitfox.top> wrote: > >> Using the dataframe API I need to implement a batch filter: >> >> DF. select(..).where(col(..) != ‘a’ and col(..) != ‘b’ and …) >> >> There are a lot of keywords should be filtered for the same column in >> where statement. >> >> How can I make it more smater? UDF or others? >> >> Thanks & Happy new Year! >> Bitfox >> >