Hi Philippe,
Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks. They
can be used, for example, to give every node a copy of a large input
dataset in an efficient manner. Spark also attempts to distribute
Hi Mich,
what exactly do you mean by « if you prefer to broadcast the reference data »?
Philippe
> Le 2 avr. 2023 à 18:16, Mich Talebzadeh a écrit :
>
> Hi Phillipe,
>
> These are my thoughts besides comments from Sean
>
> Just to clarify, you receive a CSV file periodically and you already
Wow, you guys, Anastasios, Bjørn and Mich, are stars!
Thank you very much for your suggestions. I’m going to print them and study
them closely.
> Le 2 avr. 2023 à 20:05, Anastasios Zouzias a écrit :
>
> Hi Philippe,
>
> I would like to draw your attention to this great library that saved my
Hi Philippe,
I would like to draw your attention to this great library that saved my day
in the past when parsing phone numbers in Spark:
https://github.com/google/libphonenumber
If you combine it with Bjørn's suggestions you will have a good start on
your linkage task.
Best regards,
dataset.csv
id,tel_in_dataset
1,+33
2,+331222
3,+331333
4,+331222
5,+331222
6,+331444
7,+331222
8,+331555
telephone_numbers.csv
tel
+331222
+331222
+331222
+331222
start spark with all of yous cpu and ram
import os
import multiprocessing
Hi Phillipe,
These are my thoughts besides comments from Sean
Just to clarify, you receive a CSV file periodically and you already have a
file that contains valid patterns for phone numbers (reference)
In a pseudo language you can probe your csv DF against the reference DF
// load your
That won't work, you can't use Spark within Spark like that.
If it were exact matches, the best solution would be to load both datasets
and join on telephone number.
For this case, I think your best bet is a UDF that contains the telephone
numbers as a list and decides whether a given number
Many thanks, Mich.
Is « foreach » the best construct to lookup items is a dataset such as the
below « telephonedirectory » data set?
val telrdd = spark.sparkContext.parallelize(Seq(« tel1 » , « tel2 » , « tel3
» …)) // the telephone sequence
// was read for a CSV file
val ds =