Re: [R] Discovering patterns in textual strings
You seem to be using semantics to make your choices, not merely rules-based patterns. But in any case, I cannot help. Perhaps someone else with more experience at this sort of thing or who is smarter can. -- Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Mon, May 7, 2018 at 2:02 PM, Jeff Reichman wrote: > Bert > > > > Here are some examples of the type of text strings I’m dealing with: > > > > ??.??.??? > > ??.??.?? > > ?Torrent? Pro - Torrent App > > ?Torrent?-Torrent Downloader > > 1 Pic 8 Words - Syllables > > 1 Pic 8 Words - Syllables > > 27043_Spanish songs for children > > 28.android.com.alpha.horoscope > > 28.android.com.bravo.horoscope > > 28.Card Game - Offline > > 28.card Game Multiplayer > > 37045_Spanish songs for children > > 7 Minute Workout for Weight Loss: Daily Cardio App > > 7 Minute Workout Plus > > 7 Minute Workout_SMA_IA_$2.25_com.popularapp.sevenmins_CD_ > Android_MEDIUMRECTANGLE_300x250_IAB7 > > 7 Nights at Pizza House - 2 > > 7 Nights at Pizza House 3D > > com.zombodroid > > com.zombodroid.battle > > com.zombodroid.memegenerator > > com.zone.talking.pet > > com.zone.yinshidaquan > > Disney Kingdom > > Disney Kingdom_Android > > Evite > > Evite Invitations > > Evite IOS_Evite_IOS_320x50 > > Excavator Simulator 3D:Sand > > Excavator Snow Plow Loader Truck > > Flippy Knife > > Flippy Knife - 654567 > > fliptech.iowafmworld > > fliptech.serbiafmworld > > Floor is lava! > > Floor is lava: Escape > > Go_Launcher > > Go_Launcher_Lite > > myyearbook Android > > myyearbook.com-MeetMe_Android_300x250_UK > > > > hoping to obtain something like …. > > > > ??.?? > > Torrent > > 1 Pic 8 Words > > 7 Minute Workout > > 7 Nights at Pizza House > > com.zombodroid > > com.zone > > Disney Kingdom > > Flippy Knife > > fliptech > > Floor is lava > > Go_Launcher > > myyearbook > > > > > > > > *From:* Bert Gunter > *Sent:* Saturday, May 5, 2018 2:14 AM > *To:* reichm...@sbcglobal.net > *Cc:* R-help > *Subject:* Re: [R] Discovering patterns in textual strings > > > > I am still somewhat confused by your specifications, but others may not > be. Part of my confusion stems from your failure to provide a reproducible > example (see e.g. the posting guide linked below). For example, I cannot > tell from your text whether the Abc and Bce strings contain one or more > spaces at the end. I shall assume they may but need not. > > Anyway, here is a reproducible example and solution that assumes that the > substrings/patterns of interest to you occur at the beginning of the > strings and may or may not be followed by one of "." "_" or " "(space) and > then possibly further text which should be ignored. Assuming that you are > familiar with regular expressions, maybe this will help to get you started > even if I have misunderstood your specifications. If you aren't familiar > with regex's, maybe the stringr package may provide a gentler interface > than using R's raw regex functionality. Or maybe someone else can suggest a > better approach (which is another reason why you should reply to the list, > not just me). > > z <- c("abc", >"abc_def", >"abc.def", >"abc def", >"abcd_ef", >"abcd", >"e","f") > > pats <- unique(sub("^(.+)[. _]+.*", "\\1", z)) > > ## gives: > > pats > [1] "abc" "abcd" "e""f" > > > > This gives you the four separate patterns that you could then use to group > your records, perhaps by: > > > lapply(pats,function(x)grep(paste0("^", x,"([_. ]|$)"), z)) > [[1]] > [1] 1 2 3 4 > > [[2]] > [1] 5 6 > > [[3]] > [1] 7 > > [[4]] > [1] 8 > > > > That is, indices 1-4 in z are the first group; 5 and 6 are the second; etc. > > > Cheers, > Bert > > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and > sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > > On Fri, May 4, 2018 at 9:00 PM, Jeff Reichman > wrote: > > Bert > > Thank you for the link.
Re: [R] Discovering patterns in textual strings
Bert Here are some examples of the type of text strings I’m dealing with: ??.??.??? ??.??.?? ?Torrent? Pro - Torrent App ?Torrent?-Torrent Downloader 1 Pic 8 Words - Syllables 1 Pic 8 Words - Syllables 27043_Spanish songs for children 28.android.com.alpha.horoscope 28.android.com.bravo.horoscope 28.Card Game - Offline 28.card Game Multiplayer 37045_Spanish songs for children 7 Minute Workout for Weight Loss: Daily Cardio App 7 Minute Workout Plus 7 Minute Workout_SMA_IA_$2.25_com.popularapp.sevenmins_CD_Android_MEDIUMRECTANGLE_300x250_IAB7 7 Nights at Pizza House - 2 7 Nights at Pizza House 3D com.zombodroid com.zombodroid.battle com.zombodroid.memegenerator com.zone.talking.pet com.zone.yinshidaquan Disney Kingdom Disney Kingdom_Android Evite Evite Invitations Evite IOS_Evite_IOS_320x50 Excavator Simulator 3D:Sand Excavator Snow Plow Loader Truck Flippy Knife Flippy Knife - 654567 fliptech.iowafmworld fliptech.serbiafmworld Floor is lava! Floor is lava: Escape Go_Launcher Go_Launcher_Lite myyearbook Android myyearbook.com-MeetMe_Android_300x250_UK hoping to obtain something like …. ??.?? Torrent 1 Pic 8 Words 7 Minute Workout 7 Nights at Pizza House com.zombodroid com.zone Disney Kingdom Flippy Knife fliptech Floor is lava Go_Launcher myyearbook From: Bert Gunter Sent: Saturday, May 5, 2018 2:14 AM To: reichm...@sbcglobal.net Cc: R-help Subject: Re: [R] Discovering patterns in textual strings I am still somewhat confused by your specifications, but others may not be. Part of my confusion stems from your failure to provide a reproducible example (see e.g. the posting guide linked below). For example, I cannot tell from your text whether the Abc and Bce strings contain one or more spaces at the end. I shall assume they may but need not. Anyway, here is a reproducible example and solution that assumes that the substrings/patterns of interest to you occur at the beginning of the strings and may or may not be followed by one of "." "_" or " "(space) and then possibly further text which should be ignored. Assuming that you are familiar with regular expressions, maybe this will help to get you started even if I have misunderstood your specifications. If you aren't familiar with regex's, maybe the stringr package may provide a gentler interface than using R's raw regex functionality. Or maybe someone else can suggest a better approach (which is another reason why you should reply to the list, not just me). z <- c("abc", "abc_def", "abc.def", "abc def", "abcd_ef", "abcd", "e","f") pats <- unique(sub("^(.+)[. _]+.*", "\\1 ", z)) ## gives: > pats [1] "abc" "abcd" "e""f" This gives you the four separate patterns that you could then use to group your records, perhaps by: > lapply(pats,function(x)grep(paste0("^", x,"([_. ]|$)"), z)) [[1]] [1] 1 2 3 4 [[2]] [1] 5 6 [[3]] [1] 7 [[4]] [1] 8 That is, indices 1-4 in z are the first group; 5 and 6 are the second; etc. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Fri, May 4, 2018 at 9:00 PM, Jeff Reichman mailto:reichm...@sbcglobal.net> > wrote: Bert Thank you for the link. Figured there might be something Regarding your questions This is from a large 53 Billion records. The column in question are AdNames (Real Time Bidding data) #1. Generally yes, but not always #2 Separators could be underscores (_) or dots (.) as in 1.2.3_ABC . #3 Yes. So there could be Abc 123 could be a matching string This would not be considered a match ... abc_something this.is_a long stringwithabcinthemiddle The sequence(s) are always are at the beginning (or so it appears). Out of the 54 billion records I am able to pull (SparkR sql) 948,679 unique strings. It is from these unique strings that I (if possible) want to identify the "key" strings. 1. Abc_1232.niok7j9hd 2. Abc 3. Abc.2#348hfk2.njilo 4. Abc.2 5. Abc.7 6. BAdfr_kajdhf98#kjsdh 7. BAdrf_gofer 948679 So I may have a thousand individuals strings all of which have Abc as a common string, or Badrf. So I am looking to pull "Abc," "BAdrf", etc. So then I can go back and restructure the data to show that any record with Abc_1232.niok7j9hd if part of the Abc "Group," or Family ??? Does that help Jeff -Original Message- From: Bert Gunter mailto:bgunter.4...@gmail.com> > Sent: Friday, May 4, 2018 5:41 PM To: reichm...@sbcglobal.net <mailto:reichm...@sbcglobal.n
Re: [R] Discovering patterns in textual strings
Jeff: The previous solution I sent you was hugely inefficient and frankly kind of stupid. Here is a much better and simpler solution. > z <- c("abc", "abc_def", "abc.def", "abc def", "abcd_ef", "abcd", "e","f") ## Create vector of patterns of same length as z, many of which are repeated > pats <- sub("^(.+)[. _].*","\\1",z) ## Now can use tapply() to get indices if desired ## Note that the patterns label the groups > tapply(seq_along(z),pats,I) $abc [1] 1 2 3 4 $abcd [1] 5 6 $e [1] 7 $f [1] 8 No need to reply. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sat, May 5, 2018 at 12:14 AM, Bert Gunter wrote: > "Does that help?" > > No. I am not your private consultant. You need to reply to the list, which > I have cc'ed here, not just me. > > I am still somewhat confused by your specifications, but others may not > be. Part of my confusion stems from your failure to provide a reproducible > example (see e.g. the posting guide linked below). For example, I cannot > tell from your text whether the Abc and Bce strings contain one or more > spaces at the end. I shall assume they may but need not. > > Anyway, here is a reproducible example and solution that assumes that the > substrings/patterns of interest to you occur at the beginning of the > strings and may or may not be followed by one of "." "_" or " "(space) and > then possibly further text which should be ignored. Assuming that you are > familiar with regular expressions, maybe this will help to get you started > even if I have misunderstood your specifications. If you aren't familiar > with regex's, maybe the stringr package may provide a gentler interface > than using R's raw regex functionality. Or maybe someone else can suggest a > better approach (which is another reason why you should reply to the list, > not just me). > > z <- c("abc", >"abc_def", >"abc.def", >"abc def", >"abcd_ef", >"abcd", >"e","f") > > pats <- unique(sub("^(.+)[. _]+.*", "\\1", z)) > ## gives: > > pats > [1] "abc" "abcd" "e""f" > > > This gives you the four separate patterns that you could then use to group > your records, perhaps by: > > > lapply(pats,function(x)grep(paste0("^", x,"([_. ]|$)"), z)) > [[1]] > [1] 1 2 3 4 > > [[2]] > [1] 5 6 > > [[3]] > [1] 7 > > [[4]] > [1] 8 > > That is, indices 1-4 in z are the first group; 5 and 6 are the second; etc. > > > > Cheers, > Bert > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and > sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > On Fri, May 4, 2018 at 9:00 PM, Jeff Reichman > wrote: > >> Bert >> >> Thank you for the link. Figured there might be something >> >> Regarding your questions >> >> This is from a large 53 Billion records. The column in question are >> AdNames (Real Time Bidding data) >> >> #1. Generally yes, but not always >> >> #2 Separators could be underscores (_) or dots (.) as in 1.2.3_ABC .. >> >> #3 Yes. So there could be Abc 123 could be a matching string >> >> This would not be considered a match ... >> abc_something >> this.is_a long stringwithabcinthemiddle >> >> The sequence(s) are always are at the beginning (or so it appears). Out >> of the 54 billion records I am able to pull (SparkR sql) 948,679 unique >> strings. It is from these unique strings that I (if possible) want to >> identify the "key" strings. >> >> 1. Abc_1232.niok7j9hd >> 2. Abc >> 3. Abc.2#348hfk2.njilo >> 4. Abc.2 >> 5. Abc.7 >> 6. BAdfr_kajdhf98#kjsdh >> 7. BAdrf_gofer >> 948679 >> >> >> So I may have a thousand individuals strings all of which have Abc as a >> common string, or Badrf. So I am looking to pull "Abc," "BAdrf", etc. So >> then I can go back and restructure the data to show that any record with >> Abc_1232.niok7j9hd if part of the Abc "Group," or Family ??? >> >> Does that help >> >>
Re: [R] Discovering patterns in textual strings
"Does that help?" No. I am not your private consultant. You need to reply to the list, which I have cc'ed here, not just me. I am still somewhat confused by your specifications, but others may not be. Part of my confusion stems from your failure to provide a reproducible example (see e.g. the posting guide linked below). For example, I cannot tell from your text whether the Abc and Bce strings contain one or more spaces at the end. I shall assume they may but need not. Anyway, here is a reproducible example and solution that assumes that the substrings/patterns of interest to you occur at the beginning of the strings and may or may not be followed by one of "." "_" or " "(space) and then possibly further text which should be ignored. Assuming that you are familiar with regular expressions, maybe this will help to get you started even if I have misunderstood your specifications. If you aren't familiar with regex's, maybe the stringr package may provide a gentler interface than using R's raw regex functionality. Or maybe someone else can suggest a better approach (which is another reason why you should reply to the list, not just me). z <- c("abc", "abc_def", "abc.def", "abc def", "abcd_ef", "abcd", "e","f") pats <- unique(sub("^(.+)[. _]+.*", "\\1", z)) ## gives: > pats [1] "abc" "abcd" "e""f" This gives you the four separate patterns that you could then use to group your records, perhaps by: > lapply(pats,function(x)grep(paste0("^", x,"([_. ]|$)"), z)) [[1]] [1] 1 2 3 4 [[2]] [1] 5 6 [[3]] [1] 7 [[4]] [1] 8 That is, indices 1-4 in z are the first group; 5 and 6 are the second; etc. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Fri, May 4, 2018 at 9:00 PM, Jeff Reichman wrote: > Bert > > Thank you for the link. Figured there might be something > > Regarding your questions > > This is from a large 53 Billion records. The column in question are > AdNames (Real Time Bidding data) > > #1. Generally yes, but not always > > #2 Separators could be underscores (_) or dots (.) as in 1.2.3_ABC .. > > #3 Yes. So there could be Abc 123 could be a matching string > > This would not be considered a match ... > abc_something > this.is_a long stringwithabcinthemiddle > > The sequence(s) are always are at the beginning (or so it appears). Out > of the 54 billion records I am able to pull (SparkR sql) 948,679 unique > strings. It is from these unique strings that I (if possible) want to > identify the "key" strings. > > 1. Abc_1232.niok7j9hd > 2. Abc > 3. Abc.2#348hfk2.njilo > 4. Abc.2 > 5. Abc.7 > 6. BAdfr_kajdhf98#kjsdh > 7. BAdrf_gofer > 948679 > > > So I may have a thousand individuals strings all of which have Abc as a > common string, or Badrf. So I am looking to pull "Abc," "BAdrf", etc. So > then I can go back and restructure the data to show that any record with > Abc_1232.niok7j9hd if part of the Abc "Group," or Family ??? > > Does that help > > Jeff > > -Original Message- > From: Bert Gunter > Sent: Friday, May 4, 2018 5:41 PM > To: reichm...@sbcglobal.net > Cc: R-help > Subject: Re: [R] Discovering patterns in textual strings > > The answer is, of course, using regular expressions and/or libraries > therefor. However, I do not think you have defined your problem > sufficiently. Some questions I have: > > 1. Do possible patterns to be matched always appear at the beginning of > your strings? > > 2. Always together between specified separators ("_" in your example); or > one of several specified separators; or otherwise? > > 3. Do spaces or other nonprinting characters occur in your strings? > > e.g. would > > abc_something > this.is_a long stringwithabcinthemiddle > > be considered matching? > There are undoubtedly other possibilities that I've missed. > > > > You may also find it useful to check this "task view" out for > possibilities: > https://cran.r-project.org/web/views/NaturalLanguageProcessing.html > > Cheers, > Bert > > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and > sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > &g
Re: [R] Discovering patterns in textual strings
The answer is, of course, using regular expressions and/or libraries therefor. However, I do not think you have defined your problem sufficiently. Some questions I have: 1. Do possible patterns to be matched always appear at the beginning of your strings? 2. Always together between specified separators ("_" in your example); or one of several specified separators; or otherwise? 3. Do spaces or other nonprinting characters occur in your strings? e.g. would abc_something this.is_a long stringwithabcinthemiddle be considered matching? There are undoubtedly other possibilities that I've missed. You may also find it useful to check this "task view" out for possibilities: https://cran.r-project.org/web/views/NaturalLanguageProcessing.html Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Fri, May 4, 2018 at 3:25 PM, Jeff Reichman wrote: > R Help Forum > > > > Is there a R library (or a way) that I can extract unique character strings, > or repeating patterns in textual strings. Say for example I have the > following records: > > > > Abc_1234_kjhksh_276 > > Abc > > Abc_1234_lakdofyo_324 > > Bce_876_skdhk_*&^%*& > > Bce > > Bce_454 > > > > And I would like to see the following results > > Abc > > Abc_1234 > > Bce > > > > > > Jeff Reichman > > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Discovering patterns in textual strings
R Help Forum Is there a R library (or a way) that I can extract unique character strings, or repeating patterns in textual strings. Say for example I have the following records: Abc_1234_kjhksh_276 Abc Abc_1234_lakdofyo_324 Bce_876_skdhk_*&^%*& Bce Bce_454 And I would like to see the following results Abc Abc_1234 Bce Jeff Reichman [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.