Thanks Mich. That looks good.

On Sun, Jan 2, 2022 at 7:10 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> LOL.
>
> You asking these questions takes me back to summer 2016 when I started
> writing notes on spark. Obviously earlier versions but the notion of RDD,
> Local, standalone, YARN etc. are still valid. Those days there were no k8s
> and the public cloud was not widely adopted.  I browsed it and it was
> refreshing for me. Anyway you may find some points addressing your
> questions that you tend to ask.
>
> HTH
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 2 Jan 2022 at 00:20, Bitfox <bit...@bitfox.top> wrote:
>
>> One more question, for this big filter, given my server has 4 Cores, will
>> spark (standalone mode) split the RDD to 4 partitions automatically?
>>
>> Thanks
>>
>> On Sun, Jan 2, 2022 at 6:30 AM Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> Create a list of values that you don't want anf filter oon those
>>>
>>> >>> DF = spark.range(10)
>>> >>> DF
>>> DataFrame[id: bigint]
>>> >>>
>>> >>> array = [1, 2, 3, 8]  # don't want these
>>> >>> DF.filter(DF.id.isin(array) == False).show()
>>> +---+
>>> | id|
>>> +---+
>>> |  0|
>>> |  4|
>>> |  5|
>>> |  6|
>>> |  7|
>>> |  9|
>>> +---+
>>>
>>>  or use binary NOT operator:
>>>
>>>
>>> >>> DF.filter(*~*DF.id.isin(array)).show()
>>>
>>> +---+
>>>
>>> | id|
>>>
>>> +---+
>>>
>>> |  0|
>>>
>>> |  4|
>>>
>>> |  5|
>>>
>>> |  6|
>>>
>>> |  7|
>>>
>>> |  9|
>>>
>>> +---+
>>>
>>>
>>> HTH
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 1 Jan 2022 at 20:59, Bitfox <bit...@bitfox.top> wrote:
>>>
>>>> Using the dataframe API I need to implement a batch filter:
>>>>
>>>> DF. select(..).where(col(..) != ‘a’ and col(..) != ‘b’ and …)
>>>>
>>>> There are a lot of keywords should be filtered for the same column in
>>>> where statement.
>>>>
>>>> How can I make it more smater? UDF or others?
>>>>
>>>> Thanks & Happy new Year!
>>>> Bitfox
>>>>
>>>

Reply via email to