Hi Mich, what exactly do you mean by « if you prefer to broadcast the reference data »? Philippe
> Le 2 avr. 2023 à 18:16, Mich Talebzadeh <mich.talebza...@gmail.com> a écrit : > > Hi Phillipe, > > These are my thoughts besides comments from Sean > > Just to clarify, you receive a CSV file periodically and you already have a > file that contains valid patterns for phone numbers (reference) > > In a pseudo language you can probe your csv DF against the reference DF > > // load your reference dataframe > val reference_DF=sqlContext.parquetFile("path") > > // mark this smaller dataframe to be stored in memory > reference_DF.cache() > //Create a temp table > reference_DF.createOrReplaceTempView("reference") > // Do the same on the CSV, change the line below > val csvDF = > spark.read.format("com.databricks.spark.csv").option("inferSchema", > "true").option("header", "false").load("path") > csvDF.cache() // This may or not work if CSV is large, however it is worth > trying > csvDF.createOrReplaceTempView("csv") > sqlContext.sql("JOIN Query").show > If you prefer to broadcast the reference data, you must first collect it on > the driver before you broadcast it. This requires that your RDD fits in > memory on your driver (and executors). > > You can then play around with that join. > HTH > > Mich Talebzadeh, > Lead Solutions Architect/Engineering Lead > Palantir Technologies Limited > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > https://en.everybodywiki.com/Mich_Talebzadeh > > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > > On Sun, 2 Apr 2023 at 09:17, Philippe de Rochambeau <phi...@free.fr > <mailto:phi...@free.fr>> wrote: >> Many thanks, Mich. >> Is « foreach » the best construct to lookup items is a dataset such as >> the below « telephonedirectory » data set? >> >> val telrdd = spark.sparkContext.parallelize(Seq(« tel1 » , « tel2 » , « >> tel3 » …)) // the telephone sequence >> // was read for a CSV file >> val ds = spark.read.parquet(« /path/to/telephonedirectory » ) >> >> rdd .foreach(tel => { >> longAcc.select(« * » ).rlike(« + » + tel) >> }) >> >> >> >>> Le 1 avr. 2023 à 22:36, Mich Talebzadeh <mich.talebza...@gmail.com >>> <mailto:mich.talebza...@gmail.com>> a écrit : >>> >>> This may help >>> >>> Spark rlike() Working with Regex Matching Example >>> <https://sparkbyexamples.com/spark/spark-rlike-regex-matching-examples/>s >>> Mich Talebzadeh, >>> Lead Solutions Architect/Engineering Lead >>> Palantir Technologies Limited >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>> loss, damage or destruction of data or any other property which may arise >>> from relying on this email's technical content is explicitly disclaimed. >>> The author will in no case be liable for any monetary damages arising from >>> such loss, damage or destruction. >>> >>> >>> >>> On Sat, 1 Apr 2023 at 19:32, Philippe de Rochambeau <phi...@free.fr >>> <mailto:phi...@free.fr>> wrote: >>>> Hello, >>>> I’m looking for an efficient way in Spark to search for a series of >>>> telephone numbers, contained in a CSV file, in a data set column. >>>> >>>> In pseudo code, >>>> >>>> for tel in [tel1, tel2, …. tel40,000] >>>> search for tel in dataset using .like(« %tel% ») >>>> end for >>>> >>>> I’m using the like function because the telephone numbers in the data set >>>> main contain prefixes, such as « + « ; e.g., « +3312224444 ». >>>> >>>> Any suggestions would be welcome. >>>> >>>> Many thanks. >>>> >>>> Philippe >>>> >>>> >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> <mailto:user-unsubscr...@spark.apache.org> >>>> >>