Re: Looping through a series of telephone numbers

2023-04-03 Thread Gera Shegalov
+1 to using a UDF. E.g., TransmogrifAI uses libphonenumber https://github.com/google/libphonenumber that normalizes

Re: Looping through a series of telephone numbers

2023-04-02 Thread Mich Talebzadeh
Hi Philippe, Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute

Re: Looping through a series of telephone numbers

2023-04-02 Thread Philippe de Rochambeau
Hi Mich, what exactly do you mean by « if you prefer to broadcast the reference data »? Philippe > Le 2 avr. 2023 à 18:16, Mich Talebzadeh a écrit : > > Hi Phillipe, > > These are my thoughts besides comments from Sean > > Just to clarify, you receive a CSV file periodically and you already

Re: Looping through a series of telephone numbers

2023-04-02 Thread Philippe de Rochambeau
Wow, you guys, Anastasios, Bjørn and Mich, are stars! Thank you very much for your suggestions. I’m going to print them and study them closely. > Le 2 avr. 2023 à 20:05, Anastasios Zouzias a écrit : > > Hi Philippe, > > I would like to draw your attention to this great library that saved my

Re: Looping through a series of telephone numbers

2023-04-02 Thread Anastasios Zouzias
Hi Philippe, I would like to draw your attention to this great library that saved my day in the past when parsing phone numbers in Spark: https://github.com/google/libphonenumber If you combine it with Bjørn's suggestions you will have a good start on your linkage task. Best regards,

Re: Looping through a series of telephone numbers

2023-04-02 Thread Bjørn Jørgensen
dataset.csv id,tel_in_dataset 1,+33 2,+331222 3,+331333 4,+331222 5,+331222 6,+331444 7,+331222 8,+331555 telephone_numbers.csv tel +331222 +331222 +331222 +331222 start spark with all of yous cpu and ram import os import multiprocessing

Re: Looping through a series of telephone numbers

2023-04-02 Thread Mich Talebzadeh
Hi Phillipe, These are my thoughts besides comments from Sean Just to clarify, you receive a CSV file periodically and you already have a file that contains valid patterns for phone numbers (reference) In a pseudo language you can probe your csv DF against the reference DF // load your

Re: Looping through a series of telephone numbers

2023-04-02 Thread Sean Owen
That won't work, you can't use Spark within Spark like that. If it were exact matches, the best solution would be to load both datasets and join on telephone number. For this case, I think your best bet is a UDF that contains the telephone numbers as a list and decides whether a given number

Re: Looping through a series of telephone numbers

2023-04-02 Thread Philippe de Rochambeau
Many thanks, Mich. Is « foreach » the best construct to lookup items is a dataset such as the below « telephonedirectory » data set? val telrdd = spark.sparkContext.parallelize(Seq(« tel1 » , « tel2 » , « tel3 » …)) // the telephone sequence // was read for a CSV file val ds =

Re: Looping through a series of telephone numbers

2023-04-01 Thread Mich Talebzadeh
This may help Spark rlike() Working with Regex Matching Example s Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Looping through a series of telephone numbers

2023-04-01 Thread Philippe de Rochambeau
Hello, I’m looking for an efficient way in Spark to search for a series of telephone numbers, contained in a CSV file, in a data set column. In pseudo code, for tel in [tel1, tel2, …. tel40,000] search for tel in dataset using .like(« %tel% ») end for I’m using the like function