Hi Phillipe,

These are my thoughts besides comments from Sean

Just to clarify, you receive a CSV file periodically and you already have a
file that contains valid patterns for phone numbers (reference)

In a pseudo language you can probe your csv DF against the reference DF

// load your reference dataframeval reference_DF=sqlContext.parquetFile("path")
// mark this smaller dataframe to be stored in memoryreference_DF.cache()

//Create a temp table

reference_DF.createOrReplaceTempView("reference")

// Do the same on the CSV, change the line below

val csvDF = spark.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header", "false").load("path")

csvDF.cache()  // This may or not work if CSV is large, however it is
worth trying

csvDF.createOrReplaceTempView("csv")

sqlContext.sql("JOIN Query").show

If you prefer to broadcast the reference data, you must first collect
it on the driver before you broadcast it. This requires that your RDD
fits in memory on your driver (and executors).

You can then play around with that join.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 2 Apr 2023 at 09:17, Philippe de Rochambeau <phi...@free.fr> wrote:

> Many thanks, Mich.
> Is « foreach »  the best construct to  lookup items is a dataset  such as
> the below «  telephonedirectory » data set?
>
> val telrdd = spark.sparkContext.parallelize(Seq(«  tel1 » , «  tel2 » , «  
> tel3 » …)) // the telephone sequence
>
> // was read for a CSV file
>
> val ds = spark.read.parquet(«  /path/to/telephonedirectory » )
>
>   rdd .foreach(tel => {
>     longAcc.select(«  * » ).rlike(«  + »  + tel)
>   })
>
>
>
>
> Le 1 avr. 2023 à 22:36, Mich Talebzadeh <mich.talebza...@gmail.com> a
> écrit :
>
> This may help
>
> Spark rlike() Working with Regex Matching Example
> <https://sparkbyexamples.com/spark/spark-rlike-regex-matching-examples/>s
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 1 Apr 2023 at 19:32, Philippe de Rochambeau <phi...@free.fr>
> wrote:
>
>> Hello,
>> I’m looking for an efficient way in Spark to search for a series of
>> telephone numbers, contained in a CSV file, in a data set column.
>>
>> In pseudo code,
>>
>> for tel in [tel1, tel2, …. tel40,000]
>>         search for tel in dataset using .like(« %tel% »)
>> end for
>>
>> I’m using the like function because the telephone numbers in the data set
>> main contain prefixes, such as « + « ; e.g., « +3312224444 ».
>>
>> Any suggestions would be welcome.
>>
>> Many thanks.
>>
>> Philippe
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>

Reply via email to