Re: How to get word offset all instances of a string in a chunk of text?
Mike Kerner wrote: > Since the topic of processes came up a few weeks ago I've been > thinking about what it would take to build a process/threading > framework. I wonder if a text processing subprocessor, written > and copiled... I haven't yet come across good use cases for the desktop, but will have a need for multiprocessing on Linux servers later this year. > ... in 6 would be worth everyone's time. That would be a non-starter for me. I use LSON data a lot and the format changed with v7, a lot of things with text have changed, and given the hundreds of bug fixes between then and now I prefer to work with the current version. -- Richard Gaskin Fourth World Systems Software Design and Development for the Desktop, Mobile, and the Web ambassa...@fourthworld.comhttp://www.FourthWorld.com ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: How to get word offset all instances of a string in a chunk of text?
Since the topic of processes came up a few weeks ago I've been thinking about what it would take to build a process/threading framework. I wonder if a text processing subprocessor, written and copiled in 6 would be worth everyone's time. The main app would hand off the data and the command to the subprocessor and be handed the results back. I wonder how large the dataset would have to be to make the overhead worth while. On Fri, Aug 31, 2018 at 10:43 AM Keith Clarke via use-livecode < use-livecode@lists.runrev.com> wrote: > Thanks Alex, HH & Jim for all the help & ideas. > > Just to close out the thread with a solution for future reference, the > code below now extracts from a text source a list of unique words, cleaned > up against a noise-word list, with word frequency, word & and a > comma-delimited string of the word number within the original source. > > > # Build unique words array > repeat for each trueWord W in tSource > > add 1 to tWordNum > > if tANoise[W] then next repeat > > put comma & tWordNum after tAWords[W] > > end repeat > > > # Convert unique words array to list > > repeat for each key K in tAWords > > put K && tAWords[K] & CR after tTemp > > end repeat > > > repeat for each line tLine in tTemp > > put the number of items in tLine & comma & tLine & cr after tWords > > end repeat > > > sort lines of tWords descending numeric by item 1 of each > > put tWords into field "Words" > > > Thanks & regards, > Keith > > > > > > ___ > use-livecode mailing list > use-livecode@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your > subscription preferences: > http://lists.runrev.com/mailman/listinfo/use-livecode > -- On the first day, God created the heavens and the Earth On the second day, God created the oceans. On the third day, God put the animals on hold for a few hours, and did a little diving. And God said, "This is good." ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: How to get word offset all instances of a string in a chunk of text?
Thanks Alex, HH & Jim for all the help & ideas. Just to close out the thread with a solution for future reference, the code below now extracts from a text source a list of unique words, cleaned up against a noise-word list, with word frequency, word & and a comma-delimited string of the word number within the original source. # Build unique words array repeat for each trueWord W in tSource add 1 to tWordNum if tANoise[W] then next repeat put comma & tWordNum after tAWords[W] end repeat # Convert unique words array to list repeat for each key K in tAWords put K && tAWords[K] & CR after tTemp end repeat repeat for each line tLine in tTemp put the number of items in tLine & comma & tLine & cr after tWords end repeat sort lines of tWords descending numeric by item 1 of each put tWords into field "Words" Thanks & regards, Keith ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: How to get word offset all instances of a string in a chunk of text?
Jim: > This just doesn’t work in all cases That's the key though, don't repeat when it's not necessary! A day with no repeats is an efficient day. ;) Best wishes, Curry Kenworthy Custom Software Development LiveCode Training and Consulting http://livecodeconsulting.com/ ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: How to get word offset all instances of a string in a chunk of text?
> I wrote: > > Then there is also this repeat-less approach using arrays and filter: > function findWordOffsets pText, pSearchTerm > put replaceText(pText,"\W+"," ") into pText > split pText by space > combine pText with cr and tab > filter pText with "*" & tab & pSearchTerm > sort numeric pText > return pText > end findWordOffsets This just doesn’t work in all cases because splitting by space does not assure one is splitting by true words. :( Sorry about that. Jim Lambert ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: How to get word offset all instances of a string in a chunk of text?
> On 30/08/2018 10:24, Keith Clarke via use-livecode wrote: >> Folks, >> Is there a single-pass mechanism or more efficient way of returning the >> wordOffset of each instance of ?the? in ?the quick brown fox jumped over the >> lazy dog? than to use two passes through the text? Then there is also this repeat-less approach using arrays and filter: function findWordOffsets pText, pSearchTerm put replaceText(pText,"\W+"," ") into pText split pText by space combine pText with cr and tab filter pText with "*" & tab & pSearchTerm sort numeric pText return pText end findWordOffsets put "Then the quick brown fox jumped over "The" very, very lazy red dog on the sofa.” into temp — note the extra spaces and line breaks. put findWordOffsets(temp, “the”) returns: 2 the 8 The 15 the Jim Lambert ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: How to get word offset all instances of a string in a chunk of text?
hh: > Sadly LC 9 is at about 10 times slower > than LC 6 with such fast scripts. Yes, I've been doing some benchmarks and LC 9 usually takes anywhere from 2x to 8x as long to perform a job. With or without text being involved. It is a serious problem that should not be neglected across multiple major versions of LC. I'll share a test stack and video with some examples when I have a little time. (Including one test where LC 9 held its own.) Meanwhile, optimize scripts! Then hopefully a serious boost once the engine itself is optimized. Best wishes, Curry Kenworthy Custom Software Development LiveCode Training and Consulting http://livecodeconsulting.com/ ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: How to get word offset all instances of a string in a chunk of text?
> Alex T. wrote: > > put 0 into tOffset > repeat for each trueWord W in tSource >add 1 to tOffset >if W = myWord then > put tOffset & comma after tOffsetList >end if > end repeat This is (whether trueWord or word chunks used) probably the fastest method for an offset counting of one (true)word. Possibly it is for a large tSource (say 4 MByte) better to use CR instead of comma as delimiter for the list: Else, when putting tOffsetList into a field, LC may cut the result or even hang (LC 9) because the maximum pixel size of a line gets exceeded. ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: How to get word offset all instances of a string in a chunk of text?
For a more general context see http://www.runrev.com/pipermail/use-livecode//2004-February/032280.html Sadly LC 9 is at about 10 times slower than LC 6 with such fast scripts. For example LC 6.7.11 needs at about 500 ms to evaluate a 1 MByte string, LC 9.0.0 needs at about 5 seconds. ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: How to get word offset all instances of a string in a chunk of text?
OK, this time I'm just typing into email - havent tested these suggestions :-) On 30/08/2018 10:24, Keith Clarke via use-livecode wrote: Folks, Is there a single-pass mechanism or more efficient way of returning the wordOffset of each instance of ‘the’ in ‘the quick brown fox jumped over the lazy dog’ than to use two passes through the text? Yes. For a single word myWord put 0 into tOffset repeat forever put trueWordOffset(myWord, tSource, tOffset) into tmp if tmp > 0 then put tmp & comma after tOffsetList put tmp into tOffset end if end repeat BUT there's a chance that this performs poorly, becuase of repeated skipping, so I would also benchmark the simpler put 0 into tOffset repeat for each trueWord W in tSource add 1 to tOffset if W = myWord then put tOffset & comma after tOffsetList end if end repeat Pass-1. Count the instances of ‘the’ into an array and then Pass-2. Repeat for the count of instances using wordOffset, with a wordsToSkip variable derived from the previous loop’s offset I’m I’m wondering if there’s something I’ve not yet learned about (nested?) arrays that might extend the unique word counter code that Alex, Paul & others helped me to fix a few days ago, to add a sub-array of wordOffset alongside word count? I'm not entirely sure what you want here, or what the 'N' below are. Do you want a count and an offsetList for each word ? If so, no need for nested arrays. Then I'd change your second loop below to: repeat for each trueWord W in tSource add 1 to tOffset if tANoise[W] then next repeat add 1 to tAWordCount[W] put tOffset & comma after tAWordOffsets[W] end repeat and of course the third loop to repeat for each key K in tAWordCount put k && tAWordCount[K] & CR after tmp end repeat sort lines of tmp descending numeric by word 2 of each put tmp into fld "Words" If I've misunderstood what you want, please say so and I'll try again :-) Alex. # Prepare noisewords array repeat for each trueWord W in tNoiseWords put true into tANoise[W] end repeat # Build unique words array repeat for each trueWord W in tSource if tANoise[W] then next repeat add 1 to tAWords[W][N] end repeat # Convert unique words array to list repeat for each key K in tAWords put K && tAWords[K][N] & CR after fld "Words" end repeat sort lines of field "Words" descending numeric by word 2 of each end repeat Any ideas or steer towards a lesson / worked example greatly appreciated. Best, Keith ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
How to get word offset all instances of a string in a chunk of text?
Folks, Is there a single-pass mechanism or more efficient way of returning the wordOffset of each instance of ‘the’ in ‘the quick brown fox jumped over the lazy dog’ than to use two passes through the text? Pass-1. Count the instances of ‘the’ into an array and then Pass-2. Repeat for the count of instances using wordOffset, with a wordsToSkip variable derived from the previous loop’s offset I’m I’m wondering if there’s something I’ve not yet learned about (nested?) arrays that might extend the unique word counter code that Alex, Paul & others helped me to fix a few days ago, to add a sub-array of wordOffset alongside word count? # Prepare noisewords array repeat for each trueWord W in tNoiseWords put true into tANoise[W] end repeat # Build unique words array repeat for each trueWord W in tSource if tANoise[W] then next repeat add 1 to tAWords[W][N] end repeat # Convert unique words array to list repeat for each key K in tAWords put K && tAWords[K][N] & CR after fld "Words" end repeat sort lines of field "Words" descending numeric by word 2 of each end repeat Any ideas or steer towards a lesson / worked example greatly appreciated. Best, Keith ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode