Re: How to make batch filter

2022-01-02 Thread Bitfox
I always use dataframe API, though I am pretty familiar with general SQL. I use the method you provide to create a big filter as described here: https://bitfoxtop.wordpress.com/2022/01/02/filter-out-stopwords-in-spark/ Thanks On Sun, Jan 2, 2022 at 9:06 PM Mich Talebzadeh wrote: > Well the

Re: How to make batch filter

2022-01-02 Thread Mich Talebzadeh
Well the short answer is there is no such thing as which one is more performant. Your mileage varies. SQL is a domain-specific language used in programming and designed for managing data held in a relational database management system, or for stream processing in a relational data stream

Re: How to make batch filter

2022-01-02 Thread Bitfox
May I ask for daraframe API and sql API, which is better on performance? Thanks On Sun, Jan 2, 2022 at 8:06 PM Gourav Sengupta wrote: > Hi Mich, > > your notes are really great, it really brought back the old days again :) > thanks. > > Just to note a few points that I found useful related to

Re: How to make batch filter

2022-01-02 Thread Gourav Sengupta
Hi Mich, your notes are really great, it really brought back the old days again :) thanks. Just to note a few points that I found useful related to this question: 1. cores and threads - page 5 2. executor cores and number settings - page 6.. I think that the following example may be of use,

Re: How to make batch filter

2022-01-02 Thread Bitfox
Thanks Mich. That looks good. On Sun, Jan 2, 2022 at 7:10 PM Mich Talebzadeh wrote: > LOL. > > You asking these questions takes me back to summer 2016 when I started > writing notes on spark. Obviously earlier versions but the notion of RDD, > Local, standalone, YARN etc. are still valid. Those

Re: How to make batch filter

2022-01-02 Thread Khalid Mammadov
I think, you will get 1 partition as you have only one Executor/Worker (I.e. your local machine, a node). But your tasks (smallest unit of work item in Spark framework) will be processed in parallel on your 4 core. As Spark runs one task per core. You can also force to repartition it if you want

Re: How to make batch filter

2022-01-01 Thread Bitfox
One more question, for this big filter, given my server has 4 Cores, will spark (standalone mode) split the RDD to 4 partitions automatically? Thanks On Sun, Jan 2, 2022 at 6:30 AM Mich Talebzadeh wrote: > Create a list of values that you don't want anf filter oon those > > >>> DF =

Re: How to make batch filter

2022-01-01 Thread Bitfox
That’s great thanks. On Sun, Jan 2, 2022 at 6:30 AM Mich Talebzadeh wrote: > Create a list of values that you don't want anf filter oon those > > >>> DF = spark.range(10) > >>> DF > DataFrame[id: bigint] > >>> > >>> array = [1, 2, 3, 8] # don't want these > >>> DF.filter(DF.id.isin(array) ==

Re: How to make batch filter

2022-01-01 Thread Mich Talebzadeh
Create a list of values that you don't want anf filter oon those >>> DF = spark.range(10) >>> DF DataFrame[id: bigint] >>> >>> array = [1, 2, 3, 8] # don't want these >>> DF.filter(DF.id.isin(array) == False).show() +---+ | id| +---+ | 0| | 4| | 5| | 6| | 7| | 9| +---+ or use binary NOT

How to make batch filter

2022-01-01 Thread Bitfox
Using the dataframe API I need to implement a batch filter: DF. select(..).where(col(..) != ‘a’ and col(..) != ‘b’ and …) There are a lot of keywords should be filtered for the same column in where statement. How can I make it more smater? UDF or others? Thanks & Happy new Year! Bitfox