Re: How to make batch filter

Bitfox Sun, 02 Jan 2022 05:14:26 -0800

I always use dataframe API, though I am pretty familiar with general SQL.
I use the method you provide to create a big filter as described here:


https://bitfoxtop.wordpress.com/2022/01/02/filter-out-stopwords-in-spark/

Thanks


On Sun, Jan 2, 2022 at 9:06 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Well the short answer is there is no such thing as which one is more
> performant. Your mileage varies.
>
> SQL is a domain-specific language used in programming and designed for
> managing data held in a relational database management system, or for
> stream processing in a relational data stream management system.
>
>
> A DataFrame is a *Dataset* organised into named columns. It is
> conceptually equivalent to a table in a relational database or a data frame
> in R/Python, but with richer optimizations under the hood. DataFrames can
> be constructed from a wide array of sources such as: structured data
> files, tables in Apache Hive, Google BigQuery, other external databases, or
> existing RDDs.
>
>
> You use sql-API to interact from the underlying data read through by
> constructing a dataframe on it
>
>
> The way I use it is to use either
>
>
> from pyspark.sql.functions import col
>
> DF =  spark.table("alayer.joint_accounts_view")
>
> DF.select(col('transactiondate'),col('transactiontype')).orderBy(col("transactiondate")).show(5)
>
> OR
>
>
> DF.createOrReplaceTempView("tmp") ## create a temporary view
> spark.sql("select transactiondate, transactiontype from tmp order by
> transactiondate").show(5)
>
> You use as you choose. Under the hood, these APIs are using a common
> layer. So the performance for me as a practitioner (i.e. which one is more
> performant) does not come into it.
>
>
> HTH
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 2 Jan 2022 at 12:11, Bitfox <bit...@bitfox.top> wrote:
>
>> May I ask for daraframe API and sql API, which is better on performance?
>> Thanks
>>
>> On Sun, Jan 2, 2022 at 8:06 PM Gourav Sengupta <gourav.sengu...@gmail.com>
>> wrote:
>>
>>> Hi Mich,
>>>
>>> your notes are really great, it really brought back the old days again
>>> :) thanks.
>>>
>>> Just to note a few points that I found useful related to this question:
>>> 1. cores and threads - page 5
>>> 2. executor cores and number settings - page 6..
>>>
>>>
>>> I think that the following example may be of use, note that I have one
>>> driver and that has 8 cores as I am running PYSPARK 3.1.2 in local mode,
>>> but this will give a way to find out a bit more possibly:
>>>
>>> ========================================================================
>>> >>> from pyspark.sql.types import *
>>> >>> #create the filter dataframe, there are easier ways to do the below
>>> >>> spark.createDataFrame(list(map(lambda filter:
>>> pyspark.sql.Row(filter), [0, 1, 2, 4, 7, 9])),
>>> StructType([StructField("filter_value",
>>> IntegerType())])).createOrReplaceTempView("filters")
>>> >>> #create the main table
>>> >>> spark.range(10000000000).createOrReplaceTempView("test_base")
>>> >>> spark.sql("SELECT id, FLOOR(RAND() * 10) rand FROM
>>> test_base").createOrReplaceTempView("test")
>>> >>> #see the partitions in the filters and the main table
>>> >>> spark.sql("SELECT * FROM filters").rdd.getNumPartitions()
>>> 8
>>> >>> spark.sql("SELECT * FROM test").rdd.getNumPartitions()
>>> 8
>>> >>> #see the number of partitions in the filtered join output, I am
>>> assuming implicit casting here
>>> >>> spark.sql("SELECT * FROM test WHERE rand IN (SELECT filter_value
>>> FROM filters)").rdd.getNumPartitions()
>>> 200
>>> >>> spark.sql("SET spark.sql.shuffle.partitions=10")
>>> DataFrame[key: string, value: string]
>>> >>> spark.sql("SELECT * FROM test WHERE rand IN (SELECT filter_value
>>> FROM filters)").rdd.getNumPartitions()
>>> 10
>>> ========================================================================
>>>
>>> Please do refer to the following page for adaptive sql execution in
>>> SPARK 3, it will be of massive help particularly in case you are handling
>>> skewed joins,
>>> https://spark.apache.org/docs/latest/sql-performance-tuning.html
>>>
>>>
>>> Thanks and Regards,
>>> Gourav Sengupta
>>>
>>> On Sun, Jan 2, 2022 at 11:24 AM Bitfox <bit...@bitfox.top> wrote:
>>>
>>>> Thanks Mich. That looks good.
>>>>
>>>> On Sun, Jan 2, 2022 at 7:10 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> LOL.
>>>>>
>>>>> You asking these questions takes me back to summer 2016 when I started
>>>>> writing notes on spark. Obviously earlier versions but the notion of RDD,
>>>>> Local, standalone, YARN etc. are still valid. Those days there were no k8s
>>>>> and the public cloud was not widely adopted.  I browsed it and it was
>>>>> refreshing for me. Anyway you may find some points addressing your
>>>>> questions that you tend to ask.
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, 2 Jan 2022 at 00:20, Bitfox <bit...@bitfox.top> wrote:
>>>>>
>>>>>> One more question, for this big filter, given my server has 4 Cores,
>>>>>> will spark (standalone mode) split the RDD to 4 partitions automatically?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Sun, Jan 2, 2022 at 6:30 AM Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> Create a list of values that you don't want anf filter oon those
>>>>>>>
>>>>>>> >>> DF = spark.range(10)
>>>>>>> >>> DF
>>>>>>> DataFrame[id: bigint]
>>>>>>> >>>
>>>>>>> >>> array = [1, 2, 3, 8]  # don't want these
>>>>>>> >>> DF.filter(DF.id.isin(array) == False).show()
>>>>>>> +---+
>>>>>>> | id|
>>>>>>> +---+
>>>>>>> |  0|
>>>>>>> |  4|
>>>>>>> |  5|
>>>>>>> |  6|
>>>>>>> |  7|
>>>>>>> |  9|
>>>>>>> +---+
>>>>>>>
>>>>>>>  or use binary NOT operator:
>>>>>>>
>>>>>>>
>>>>>>> >>> DF.filter(*~*DF.id.isin(array)).show()
>>>>>>>
>>>>>>> +---+
>>>>>>>
>>>>>>> | id|
>>>>>>>
>>>>>>> +---+
>>>>>>>
>>>>>>> |  0|
>>>>>>>
>>>>>>> |  4|
>>>>>>>
>>>>>>> |  5|
>>>>>>>
>>>>>>> |  6|
>>>>>>>
>>>>>>> |  7|
>>>>>>>
>>>>>>> |  9|
>>>>>>>
>>>>>>> +---+
>>>>>>>
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, 1 Jan 2022 at 20:59, Bitfox <bit...@bitfox.top> wrote:
>>>>>>>
>>>>>>>> Using the dataframe API I need to implement a batch filter:
>>>>>>>>
>>>>>>>> DF. select(..).where(col(..) != ‘a’ and col(..) != ‘b’ and …)
>>>>>>>>
>>>>>>>> There are a lot of keywords should be filtered for the same column
>>>>>>>> in where statement.
>>>>>>>>
>>>>>>>> How can I make it more smater? UDF or others?
>>>>>>>>
>>>>>>>> Thanks & Happy new Year!
>>>>>>>> Bitfox
>>>>>>>>
>>>>>>>

Re: How to make batch filter

Reply via email to