Hi Mich,
what exactly do you mean by « if you prefer to broadcast the reference data »?
Philippe

> Le 2 avr. 2023 à 18:16, Mich Talebzadeh <mich.talebza...@gmail.com> a écrit :
> 
> Hi Phillipe,
> 
> These are my thoughts besides comments from Sean
> 
> Just to clarify, you receive a CSV file periodically and you already have a 
> file that contains valid patterns for phone numbers (reference)
> 
> In a pseudo language you can probe your csv DF against the reference DF
> 
> // load your reference dataframe
> val reference_DF=sqlContext.parquetFile("path")
> 
> // mark this smaller dataframe to be stored in memory
> reference_DF.cache()
> //Create a temp table
> reference_DF.createOrReplaceTempView("reference")
> // Do the same on the CSV, change the line below
> val csvDF = 
> spark.read.format("com.databricks.spark.csv").option("inferSchema", 
> "true").option("header", "false").load("path")
> csvDF.cache()  // This may or not work if CSV is large, however it is worth 
> trying
> csvDF.createOrReplaceTempView("csv")
> sqlContext.sql("JOIN Query").show
> If you prefer to broadcast the reference data, you must first collect it on 
> the driver before you broadcast it. This requires that your RDD fits in 
> memory on your driver (and executors).
> 
> You can then play around with that join.
> HTH
> 
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> 
>    view my Linkedin profile 
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Sun, 2 Apr 2023 at 09:17, Philippe de Rochambeau <phi...@free.fr 
> <mailto:phi...@free.fr>> wrote:
>> Many thanks, Mich.
>> Is « foreach »  the best construct to  lookup items is a dataset  such as 
>> the below «  telephonedirectory » data set?
>> 
>> val telrdd = spark.sparkContext.parallelize(Seq(«  tel1 » , «  tel2 » , «  
>> tel3 » …)) // the telephone sequence
>> // was read for a CSV file
>> val ds = spark.read.parquet(«  /path/to/telephonedirectory » )
>>   
>>   rdd .foreach(tel => {
>>     longAcc.select(«  * » ).rlike(«  + »  + tel)
>>   })
>> 
>> 
>> 
>>> Le 1 avr. 2023 à 22:36, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> a écrit :
>>> 
>>> This may help
>>> 
>>> Spark rlike() Working with Regex Matching Example 
>>> <https://sparkbyexamples.com/spark/spark-rlike-regex-matching-examples/>s
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> 
>>>    view my Linkedin profile 
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>> 
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>> 
>>>  
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>> 
>>> On Sat, 1 Apr 2023 at 19:32, Philippe de Rochambeau <phi...@free.fr 
>>> <mailto:phi...@free.fr>> wrote:
>>>> Hello,
>>>> I’m looking for an efficient way in Spark to search for a series of 
>>>> telephone numbers, contained in a CSV file, in a data set column.
>>>> 
>>>> In pseudo code,
>>>> 
>>>> for tel in [tel1, tel2, …. tel40,000] 
>>>>         search for tel in dataset using .like(« %tel% »)
>>>> end for 
>>>> 
>>>> I’m using the like function because the telephone numbers in the data set 
>>>> main contain prefixes, such as « + « ; e.g., « +3312224444 ».
>>>> 
>>>> Any suggestions would be welcome.
>>>> 
>>>> Many thanks.
>>>> 
>>>> Philippe
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>>>> <mailto:user-unsubscr...@spark.apache.org>
>>>> 
>> 

Reply via email to