Re: How to make batch filter

2022-01-02 Thread Bitfox
I always use dataframe API, though I am pretty familiar with general SQL. I use the method you provide to create a big filter as described here: https://bitfoxtop.wordpress.com/2022/01/02/filter-out-stopwords-in-spark/ Thanks On Sun, Jan 2, 2022 at 9:06 PM Mich Talebzadeh wrote: > Well the

Re: How to make batch filter

2022-01-02 Thread Mich Talebzadeh
Well the short answer is there is no such thing as which one is more performant. Your mileage varies. SQL is a domain-specific language used in programming and designed for managing data held in a relational database management system, or for stream processing in a relational data stream

Re: How to make batch filter

2022-01-02 Thread Bitfox
May I ask for daraframe API and sql API, which is better on performance? Thanks On Sun, Jan 2, 2022 at 8:06 PM Gourav Sengupta wrote: > Hi Mich, > > your notes are really great, it really brought back the old days again :) > thanks. > > Just to note a few points that I found useful related to

Re: How to make batch filter

2022-01-02 Thread Gourav Sengupta
Hi Mich, your notes are really great, it really brought back the old days again :) thanks. Just to note a few points that I found useful related to this question: 1. cores and threads - page 5 2. executor cores and number settings - page 6.. I think that the following example may be of use,

Re: How to make batch filter

2022-01-02 Thread Bitfox
Thanks Mich. That looks good. On Sun, Jan 2, 2022 at 7:10 PM Mich Talebzadeh wrote: > LOL. > > You asking these questions takes me back to summer 2016 when I started > writing notes on spark. Obviously earlier versions but the notion of RDD, > Local, standalone, YARN etc. are still valid. Those

Re: How to make batch filter

2022-01-02 Thread Khalid Mammadov
I think, you will get 1 partition as you have only one Executor/Worker (I.e. your local machine, a node). But your tasks (smallest unit of work item in Spark framework) will be processed in parallel on your 4 core. As Spark runs one task per core. You can also force to repartition it if you want