I always use dataframe API, though I am pretty familiar with general SQL.
I use the method you provide to create a big filter as described here:
https://bitfoxtop.wordpress.com/2022/01/02/filter-out-stopwords-in-spark/
Thanks
On Sun, Jan 2, 2022 at 9:06 PM Mich Talebzadeh
wrote:
> Well the
Well the short answer is there is no such thing as which one is more
performant. Your mileage varies.
SQL is a domain-specific language used in programming and designed for
managing data held in a relational database management system, or for
stream processing in a relational data stream
May I ask for daraframe API and sql API, which is better on performance?
Thanks
On Sun, Jan 2, 2022 at 8:06 PM Gourav Sengupta
wrote:
> Hi Mich,
>
> your notes are really great, it really brought back the old days again :)
> thanks.
>
> Just to note a few points that I found useful related to
Hi Mich,
your notes are really great, it really brought back the old days again :)
thanks.
Just to note a few points that I found useful related to this question:
1. cores and threads - page 5
2. executor cores and number settings - page 6..
I think that the following example may be of use,
Thanks Mich. That looks good.
On Sun, Jan 2, 2022 at 7:10 PM Mich Talebzadeh
wrote:
> LOL.
>
> You asking these questions takes me back to summer 2016 when I started
> writing notes on spark. Obviously earlier versions but the notion of RDD,
> Local, standalone, YARN etc. are still valid. Those
I think, you will get 1 partition as you have only one Executor/Worker
(I.e. your local machine, a node). But your tasks (smallest unit of work
item in Spark framework) will be processed in parallel on your 4 core. As
Spark runs one task per core.
You can also force to repartition it if you want
One more question, for this big filter, given my server has 4 Cores, will
spark (standalone mode) split the RDD to 4 partitions automatically?
Thanks
On Sun, Jan 2, 2022 at 6:30 AM Mich Talebzadeh
wrote:
> Create a list of values that you don't want anf filter oon those
>
> >>> DF =
That’s great thanks.
On Sun, Jan 2, 2022 at 6:30 AM Mich Talebzadeh
wrote:
> Create a list of values that you don't want anf filter oon those
>
> >>> DF = spark.range(10)
> >>> DF
> DataFrame[id: bigint]
> >>>
> >>> array = [1, 2, 3, 8] # don't want these
> >>> DF.filter(DF.id.isin(array) ==
Create a list of values that you don't want anf filter oon those
>>> DF = spark.range(10)
>>> DF
DataFrame[id: bigint]
>>>
>>> array = [1, 2, 3, 8] # don't want these
>>> DF.filter(DF.id.isin(array) == False).show()
+---+
| id|
+---+
| 0|
| 4|
| 5|
| 6|
| 7|
| 9|
+---+
or use binary NOT
Using the dataframe API I need to implement a batch filter:
DF. select(..).where(col(..) != ‘a’ and col(..) != ‘b’ and …)
There are a lot of keywords should be filtered for the same column in where
statement.
How can I make it more smater? UDF or others?
Thanks & Happy new Year!
Bitfox
10 matches
Mail list logo