I always use dataframe API, though I am pretty familiar with general SQL.
I use the method you provide to create a big filter as described here:
https://bitfoxtop.wordpress.com/2022/01/02/filter-out-stopwords-in-spark/
Thanks
On Sun, Jan 2, 2022 at 9:06 PM Mich Talebzadeh
wrote:
> Well the
Well the short answer is there is no such thing as which one is more
performant. Your mileage varies.
SQL is a domain-specific language used in programming and designed for
managing data held in a relational database management system, or for
stream processing in a relational data stream
May I ask for daraframe API and sql API, which is better on performance?
Thanks
On Sun, Jan 2, 2022 at 8:06 PM Gourav Sengupta
wrote:
> Hi Mich,
>
> your notes are really great, it really brought back the old days again :)
> thanks.
>
> Just to note a few points that I found useful related to
Hi Mich,
your notes are really great, it really brought back the old days again :)
thanks.
Just to note a few points that I found useful related to this question:
1. cores and threads - page 5
2. executor cores and number settings - page 6..
I think that the following example may be of use,
Thanks Mich. That looks good.
On Sun, Jan 2, 2022 at 7:10 PM Mich Talebzadeh
wrote:
> LOL.
>
> You asking these questions takes me back to summer 2016 when I started
> writing notes on spark. Obviously earlier versions but the notion of RDD,
> Local, standalone, YARN etc. are still valid. Those
I think, you will get 1 partition as you have only one Executor/Worker
(I.e. your local machine, a node). But your tasks (smallest unit of work
item in Spark framework) will be processed in parallel on your 4 core. As
Spark runs one task per core.
You can also force to repartition it if you want