Re: MatchText, MatchChunk and the needle in the haystack
Can you do it with a text editor and regular expressions? I'm genuinely diffident about asking, because you all have so much more experience that if it were this easy, you'd have suggested it. But anyway, is there something wrong with the following? I made up a fragment of a file like this in the form 02-Mar-92sometext01-Sep-04somemore textand a few more entries of the same sort. Then opened it in Kate (but presumably all programming editors have similar functionality?) Then did a match with regular expressions in the Find part of the menu. It helped construct the following expression: [\d][\d]-[\D][\D][\D]-[\d][\d] which really would not have been so very hard to figure out unaided - a classic case of the obligatory gui getting in the way of your typing. This picks up all dates and it obviously misses other hyphenated expressions. Then in the replace section I put Enter\0 It uses the \0 as backwards reference, so to include all the found string in the replacement. The only hard part, all of ten seconds, was that I didn't seem able to enter a line feed character directly, like by \n for instance, but I just copied and pasted one and bingo, it worked fine. I ended up with a bunch of lines like this: 02-Mar-92sometext 01-Sep-04somemore text..and so on. Was that what was wanted? This was almost instant. I guess if I'd a lot to do, I would think of an awk one liner, but have forgotten how to do backward references in awk. And it would be even more embarrassing to have both got the above all wrong and to also cite duff awk scripts! Peter ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: MatchText, MatchChunk and the needle in the haystack
On 3/21/07 1:32 AM, Peter Alcibiades [EMAIL PROTECTED] wrote: Can you do it with a text editor and regular expressions? I'm genuinely diffident about asking, because you all have so much more experience that if it were this easy, you'd have suggested it. full text below My basic approach for this kind of question is to assume that users have very little experience with regular expressions combined with knowing very little about the data set they are mining. Also, the question they actually ask on the list is just one part of the over-all task. Given these three things, I like to propose tools that let them see some of the pit falls that making incorrect assumptions about the date can create. One pit fall is assuming all occurrences of the date string will be correctly formatted and intact. I guess I look at it as 'what will help them build a tool they can trust'. Don't get me wrong, I like and use regEx in a few of my apps for effectively extracting clean data from a variety of web sites. I like its power and flexibility. As you say, if the user already knew some of the simpler regEx, the question probably would not have appeared on the list. I cannot speak for others on the list, but it seems that those who venture into regEx only occasionally, get frustrated and are better off using the chunking expressions of Rev. Even when presented with a good regEx answer, they are not sure what they are looking at. By the way, nulls will make MatchText, etc fail, so replace null with empty in textBlock needs to be part of the process for unknown data sources. As far as using a text editor, that is usually my first step. I like BBEdit on an OSX platform, so I agree with your basic premise, start simple and build up. Nice to know you are paying attention to the big picture :-) Good post. Jim Ault Las Vegas On 3/21/07 1:32 AM, Peter Alcibiades [EMAIL PROTECTED] wrote: Can you do it with a text editor and regular expressions? I'm genuinely diffident about asking, because you all have so much more experience that if it were this easy, you'd have suggested it. But anyway, is there something wrong with the following? I made up a fragment of a file like this in the form 02-Mar-92sometext01-Sep-04somemore textand a few more entries of the same sort. Then opened it in Kate (but presumably all programming editors have similar functionality?) Then did a match with regular expressions in the Find part of the menu. It helped construct the following expression: [\d][\d]-[\D][\D][\D]-[\d][\d] which really would not have been so very hard to figure out unaided - a classic case of the obligatory gui getting in the way of your typing. This picks up all dates and it obviously misses other hyphenated expressions. Then in the replace section I put Enter\0 It uses the \0 as backwards reference, so to include all the found string in the replacement. The only hard part, all of ten seconds, was that I didn't seem able to enter a line feed character directly, like by \n for instance, but I just copied and pasted one and bingo, it worked fine. I ended up with a bunch of lines like this: 02-Mar-92sometext 01-Sep-04somemore text..and so on. Was that what was wanted? This was almost instant. I guess if I'd a lot to do, I would think of an awk one liner, but have forgotten how to do backward references in awk. And it would be even more embarrassing to have both got the above all wrong and to also cite duff awk scripts! Peter ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: MatchText, MatchChunk and the needle in the haystack
There is a wonderful book I just found for this sort of thing, and am working through: Minimal Perl, by Tim Maher. Awk is great, terse, powerful, but a bit opaque. And more up to date people always seem to talk about using Perl for what awk always was used for. Well, if you ever felt you too should come up to date, got tired of pitying looks when you mentioned awk, took up some materials on Perl and then threw up your hands in despair, get Minimal Perl. Clear, practical, easy, and with a focus on one liners of exactly the sort that you'd use in the situation on this thread. You could call it 'Perl for the rest of us' Text manipulation without tears. Peter ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: MatchText, MatchChunk and the needle in the haystack
Jim, Dave, Devin Thanks for your help in making me think harder about this. I literally woke up out of a dream this morning and knew right away what was wrong with the script. There was one error that would have persistently been a problem that I have fixed now. In the interests of anyone else who encounters a similar horrible string task, the solution is provided below. One more thing. You all get credit for making me think harder about what else was in the files that might have been a random char throwing things off. Now, I did go and change the script to make it simpler. I realized I only needed to find the hyphen at the start of the date and simply advance forward past the next hyphen in the date string. Since we were dealing with fixed length records forward from the first hyphen (three char month, hyphen, two char year) this was the simplest way. Genius? I thought so. As luck would happen I had hit upon the few records that were problem children right off the bat. It turned out that a few of the records had the word in-line with a hyphen which threw off the whole thing. So there is a separate script when the file is read in that checks now for nulls, odd-ball ascii codes, and our friend in-line. I was lucky in this case that the records were so simple. The alternative would have been to keep the -Jan-...-Dec- chunks and walk through the file 12 times. No big deal I suppose and it could always be done that way if one had different chunks to search for. Anyway, here is the finished script with comments. I hope it helps others who might have similar issues. I have over 5000 of these files to do which will now take about ten minutes versus the agony (and days) I'd have had to endure if there had been no community here to draw upon for help and if rev was not so darn handy. By the way the script that adds the return character also puts in a comma in the right place after the date so that I have another delimiter to work with and the record in the end is comma delimited with a return character as the record marker. Much better than the ugly long single string I started out with. Thanks All. -- on mouseUp put fld 1 into textBlock put makeOffsets(-,textBlock,1) into varOffsets sort lines of varOffsets numeric descending -- this is the only way it works as otherwise the char count gets thrown -- off. essentially we are working up from the end of the string forward repeat for each line varRecord in varOffsets put char varRecord-2 to varRecord-1 of textBlock into eval if char 1 of eval is a number and char 2 of eval is a number then put comma after char varRecord+6 of textBlock put cr before char varRecord-2 of textBlock else if char 1 of eval is not a number and char 2 of eval is a number then put comma after char varRecord+6 of textBlock put cr before char varRecord-1 of textBlock end if end if end repeat put textBlock into fld 1 end mouseUp function makeOffsets varChunk,textBlock,posStart if posStart = empty then put 1 into pos else put posStart into pos end if repeat until varOffset = 0 put offset(varChunk, textBlock, pos) into varOffset if varOffset 0 then put varOffset+posreturn after newText -- this was what was mucked-up in the original script -- have to add the prior pos to the new one since we -- are using the skip chars option and need to add -- add the prior position to the new relative pos add varOffset+length(varChunk)+6 to pos -- i could get away with adding a fixed number in this -- case since the date was never going to be shorter than -- six chars + the found offset + chunk, (-) in this case else exit repeat end if end repeat return newText end makeOffsets ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: MatchText, MatchChunk and the needle in the haystack
Jim, Dave, Devin Thanks for your help in making me think harder about this. I literally woke up out of a dream this morning and knew right away what was wrong with the script. There was one error that would have persistently been a problem that I have fixed now. Glad it worked out so well. Data mining is a tricky business, especially if the originator allows delimiters to also be content (such as commas and hyphens). The one change I would make in your routine is the use of a tab instead of a comma as a delim, since this is a common character, but that depends on your data set. I assume that are not encountering and commas in the data. I love the mornings when I wake up and realize the answer to a programming puzzle. No matter what the weather, it is a sunny day for me :-) Jim Ault Las Vegas On 3/20/07 3:12 AM, Bryan McCormick [EMAIL PROTECTED] wrote: Jim, Dave, Devin Thanks for your help in making me think harder about this. I literally woke up out of a dream this morning and knew right away what was wrong with the script. There was one error that would have persistently been a problem that I have fixed now. In the interests of anyone else who encounters a similar horrible string task, the solution is provided below. One more thing. You all get credit for making me think harder about what else was in the files that might have been a random char throwing things off. Now, I did go and change the script to make it simpler. I realized I only needed to find the hyphen at the start of the date and simply advance forward past the next hyphen in the date string. Since we were dealing with fixed length records forward from the first hyphen (three char month, hyphen, two char year) this was the simplest way. Genius? I thought so. As luck would happen I had hit upon the few records that were problem children right off the bat. It turned out that a few of the records had the word in-line with a hyphen which threw off the whole thing. So there is a separate script when the file is read in that checks now for nulls, odd-ball ascii codes, and our friend in-line. I was lucky in this case that the records were so simple. The alternative would have been to keep the -Jan-...-Dec- chunks and walk through the file 12 times. No big deal I suppose and it could always be done that way if one had different chunks to search for. Anyway, here is the finished script with comments. I hope it helps others who might have similar issues. I have over 5000 of these files to do which will now take about ten minutes versus the agony (and days) I'd have had to endure if there had been no community here to draw upon for help and if rev was not so darn handy. By the way the script that adds the return character also puts in a comma in the right place after the date so that I have another delimiter to work with and the record in the end is comma delimited with a return character as the record marker. Much better than the ugly long single string I started out with. Thanks All. -- on mouseUp put fld 1 into textBlock put makeOffsets(-,textBlock,1) into varOffsets sort lines of varOffsets numeric descending -- this is the only way it works as otherwise the char count gets thrown -- off. essentially we are working up from the end of the string forward repeat for each line varRecord in varOffsets put char varRecord-2 to varRecord-1 of textBlock into eval if char 1 of eval is a number and char 2 of eval is a number then put comma after char varRecord+6 of textBlock put cr before char varRecord-2 of textBlock else if char 1 of eval is not a number and char 2 of eval is a number then put comma after char varRecord+6 of textBlock put cr before char varRecord-1 of textBlock end if end if end repeat put textBlock into fld 1 end mouseUp function makeOffsets varChunk,textBlock,posStart if posStart = empty then put 1 into pos else put posStart into pos end if repeat until varOffset = 0 put offset(varChunk, textBlock, pos) into varOffset if varOffset 0 then put varOffset+posreturn after newText -- this was what was mucked-up in the original script -- have to add the prior pos to the new one since we -- are using the skip chars option and need to add -- add the prior position to the new relative pos add varOffset+length(varChunk)+6 to pos -- i could get away with adding a fixed number in this -- case since the date was never going to be shorter than -- six chars + the found offset + chunk, (-) in this case else exit repeat end if end repeat return newText end makeOffsets ___ use-revolution mailing list use-revolution@lists.runrev.com Please
Re: MatchText, MatchChunk and the needle in the haystack
On Mar 20, 2007, at 9:29 AM, Jim Ault wrote: On Mar 20, 2007, at 4:12 AM, Bryan McCormick wrote: Jim, Dave, Devin Thanks for your help in making me think harder about this. I literally woke up out of a dream this morning and knew right away what was wrong with the script. There was one error that would have persistently been a problem that I have fixed now. Glad I was able to help jostle some brain cells. I love the mornings when I wake up and realize the answer to a programming puzzle. Amen! No matter what the weather, it is a sunny day for me :-) Wait, when is it ever *not* sunny in Las Vegas? ;-) Devin Devin Asay Humanities Technology and Research Support Center Brigham Young University ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: MatchText, MatchChunk and the needle in the haystack
On 3/20/07 9:42 AM, Devin Asay [EMAIL PROTECTED] wrote: Wait, when is it ever *not* sunny in Las Vegas? ;-) Very seldom. About twice a year we will have 3 days in a row of cloudy weather. Jim ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Re: MatchText, MatchChunk and the needle in the haystack
Jim, Thanks for the script snippet. It didn't quite work as shown, but it did get me to think about the problem more carefully. I came up with this: put -Jan-,-Feb-,-Mar-,-Apr-,-May-,-Jun-,-Jul-,-Aug-,-Sep-,-Oct-,-Nov-,-Dec- into mthStrings -- i seemed to need to separate the routine here, running it with the loops as shown didn't function as expected. repeat for each item mth in mthStrings put makeOffsets(mth,textBlock) after varOffsets end repeat sorts line of varOffsets numeric -- note that i added a third param in case i need to force the routine to start elsewhere. it is set to 0 when i run this on the string in question (which by the way is about 5000 chars long) function makeOffsets mth,textBlock,posStart if posStart = empty then put 0 into pos else put posStart into pos end if repeat until varOffset = 0 put offset(mth, textBlock, pos) into varOffset if varOffset 0 and varOffset posStart then if pos 0 then put posreturn after newText end if add varoffset+length(mth)+1 to pos else exit repeat end if end repeat return newText end makeOffsets There is another routine that then does some manipulation on the returned offsets since I need to put the return in BEFORE the date and as luck would have it the day part of the date (format is day-month-year) is not always two characters so I had to add in a routine to check for numerics back from the offset position. Here is the odd thing though. As far as I can see the script should work perfectly on a string without any delims and a bunch of dates in it. Oddly this is not the case. It mostly works (which means I've made a mistake or the file isn't quite as neat as I think it is) but gets thrown off and does not find offsets that it should. It does not seem to matter how long or short the record is nor does it happen consistently in the same place. But it always happens. I've looked for possible length errors (did I overshoot a record) but that does not seem possible or the whole thing would be broken. What happens is, randomly it seems, some lines contain multiple records in a single string. Thoughts greatly appreciated. I could (and probably will) write another routine for expediency to walk through the lines of the partially correct records to see if there is another date line item in it, but I have to say I am stumped as to how it could be skipping over some records and then finding them just fine after the error occurs. I checked for random oddball chars and confirmed that the dates not found are in fact properly formatted as x or xx-JAN-xx. And oh yes, I am able to find the offset(-Nov-, fld 1) in the field that the resulting partially recovered list is placed in. So it does not appear to be an offset bug, not one that I can see anyway. ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: MatchText, MatchChunk and the needle in the haystack
Hi, Not 100% sure, but should you start from 1? e.g. put 0 into pos should be: put 1 into pos All the Best Dave On 19 Mar 2007, at 17:24, Bryan McCormick wrote: -- note that i added a third param in case i need to force the routine to start elsewhere. it is set to 0 when i run this on the string in question (which by the way is about 5000 chars long) function makeOffsets mth,textBlock,posStart if posStart = empty then put 0 into pos else put posStart into pos end if repeat until varOffset = 0 put offset(mth, textBlock, pos) into varOffset if varOffset 0 and varOffset posStart then if pos 0 then put posreturn after newText end if add varoffset+length(mth)+1 to pos else exit repeat end if end repeat return newText end makeOffsets ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: MatchText, MatchChunk and the needle in the haystack
On 3/19/07 10:49 AM, Bryan McCormick [EMAIL PROTECTED] wrote: Dave, Sadly it does not impact the outcome. Mind you I tried it just in case. I have played with all the vars that I can think of and it does nothing. It does not even appear to matter (as I thought) if there are multiple months (i.e -Jan-) of the same type in a row (thought it might be finding the first and missing the others somehow, but no), it doesn't matter if the date is of xx-Month-xx, or x-Month-XX, nor does it matter the order or how often these appear in the string and it doesn't seem to matter how long or short the record or the file happens to be. It should work as far as I can see. I am stumped at this point. It is an error for sure (on my end) it is just really subtle it seems. Or it will be until someone points the magic finger and says here it is you idiot! A couple ideas -1--- make sure that you are not changing the length of the textBlock with replacements. This could accumulate to a significant offset error, depending on how you build your loops. -2--- test for null chars [00 ascii] used in some file formats put length(textBlock) into origCharCnt replace null with empty answer length(textBlock) - origCharCnt -3--- do inspections to see if something is creating a false hit or false miss put the number of lines in textBlock into foundMth repeat for each month string, -Jan-, -Feb-, etc. replace -Jan- with cr Jan- in textBlock get the number of lines in textBlock - sum(foundMth) put it ,after foundMth breakpoint --now inspect the textBlock foundMth for any odd occurrences -- optional is tofilter textBlock without *Jan- thus purging as you go end repeat Jim Ault Las Vegas ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: MatchText, MatchChunk and the needle in the haystack
On Mar 19, 2007, at 11:24 AM, Bryan McCormick wrote: Jim, Thanks for the script snippet. It didn't quite work as shown, but it did get me to think about the problem more carefully. I came up with this: put -Jan-,-Feb-,-Mar-,-Apr-,-May-,-Jun-,-Jul-,-Aug-,-Sep-,-Oct-,- Nov-,-Dec- into mthStrings Bryan, Is it possible that the original text string is not using hyphens consistently? Could there perhaps be en-dash and/or em-dash characters there, which look just like hyphens in monospaced fonts. If the original text was created in MS Word, for example, it often auto-substitutes en- or em-dashes for hyphens. Just a thought. Devin Devin Asay Humanities Technology and Research Support Center Brigham Young University ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: MatchText, MatchChunk and the needle in the haystack
A simplistic approach would be to find the -mth- string and work from there on untested put -Jan- -Feb- -Mar- into mthStrings repeat for each word MTH in mthStrings put 1 into pos repeat until pos = 0 put offset(mth, textBlock, pos+2) into pos put cr into char pos - 2 of textBlock end repeat end repeat --now textBlock should have cr's in the right spots end untested Jim Ault Las Vegas On 3/18/07 4:12 PM, Bryan McCormick [EMAIL PROTECTED] wrote: Folks, I have been given a batch of text files that have had their delimiters stripped off (by accident) leaving a single string of text to parse back into record delimited form. And yes, of course, there is no back-up so it is the strings or nothing. I really know very little about using RegEx, but I presume this could at least in part solve the problem. Basically the only good news is that each record was originally delimited in the form of 24-Jan-02 so that as long as each date could be plucked out of the string it ought to be possible to grab the offset and then introduce a return before the next date occurrence. As in the text is 06-Mar-92therewasamangledbitoftexttodealwith02-Apr-92therest... I cannot seem to get the MatchText to work properly to identify these, but I guess really the problem is I still need to find an offset for each. Is MatchText even the right thing to use? Can I use it in conjunction with offset(MatchText(myVar,[0-9]-(Jan|Feb|Mar...|Dec)-[0-9],someVar)) to find each one? Or is this a case where the string has to be brute forced? Any ideas on how to proceed? Any ideas, snippets would be greatly appreciated. ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution