Re: Suggestions for how to approach this problem?
James Stroud wrote: > import re > records = [] > record = None > counter = 1 > regex = re.compile(r'^(\d+)\. (.*)') > for aline in lines: > m = regex.search(aline) > if m is not None: > recnum, aline = m.groups() > if int(recnum) == counter: > if record is not None: > records.append(record) > record = [aline.strip()] > counter += 1 > continue > record.append(aline.strip()) > > if record is not None: > records.append(record) > > records = [" ".join(r) for r in records] What do I need to do to get this to run against the text that I have? Is 'lines' meant to be a list of the lines from the original citation file? -- http://mail.python.org/mailman/listinfo/python-list
Re: Suggestions for how to approach this problem?
James Stroud wrote: > I included code in my previous post that will parse the entire bib, > making use of the numbering and eliminating the most probable, but still > fairly rare, potential ambiguity. You might want to check out that code, > as my testing it showed that it worked with your example. Thanks. It looked a little involved so I hadn't started to work through it yet, but I'll do that now before I actually try to write something from scratch. :) -- http://mail.python.org/mailman/listinfo/python-list
Re: Suggestions for how to approach this problem?
John Salerno wrote: > John Salerno wrote: > >> So I need to remove the line breaks too, but of course not *all* of >> them because each reference still needs a line break between it. > > > After doing a bit of search and replace for tabs with my text editor, I > think I've narrowed down the problem to just this: > > I need to remove all newline characters that are not at the end of a > citation (and replace them with a single space). That is, those that are > not followed by the start of a new numbered citation. This seems to > involve a look-ahead RE, but I'm not sure how to write those. This is > what I came up with: > > > \n(?=(\d)+) > > (I can never remember if I need parentheses around '\d' or if the + > should be inside it or not! I included code in my previous post that will parse the entire bib, making use of the numbering and eliminating the most probable, but still fairly rare, potential ambiguity. You might want to check out that code, as my testing it showed that it worked with your example. James -- http://mail.python.org/mailman/listinfo/python-list
Re: Suggestions for how to approach this problem?
John Salerno wrote: > So I need to remove the line breaks too, but of course not *all* of them > because each reference still needs a line break between it. After doing a bit of search and replace for tabs with my text editor, I think I've narrowed down the problem to just this: I need to remove all newline characters that are not at the end of a citation (and replace them with a single space). That is, those that are not followed by the start of a new numbered citation. This seems to involve a look-ahead RE, but I'm not sure how to write those. This is what I came up with: \n(?=(\d)+) (I can never remember if I need parentheses around '\d' or if the + should be inside it or not! -- http://mail.python.org/mailman/listinfo/python-list
Re: Suggestions for how to approach this problem?
James Stroud wrote: > If you can count on the person not skipping any numbers in the > citations, you can take an "AI" approach to hopefully weed out the rare > circumstance that a number followed by a period starts a line in the > middle of the citation. I don't think any numbers are skipped, but there are some cases where a number is followed by a period within a citation. But this might not matter since each reference number begins at the start of the line, so I could use the RE to start at the beginning. -- http://mail.python.org/mailman/listinfo/python-list
Re: Suggestions for how to approach this problem?
Dave Hansen wrote: > Questions: > > 1) Do the citation numbers always begin in column 1? Yes, that's one consistency at least. :) > 2) Are the citation numbers always followed by a period and then at > least one whitespace character? Yes, it seems to be either one or two whitespaces. > find the beginning of each cite. then I would output each cite > through a state machine that would reduce consecutive whitespace > characters (space, tab, newline) into a single character, separating > each cite with a newline. Interesting idea! I'm not sure what "state machine" is, but it sounds like you are suggesting that I more or less separate each reference, process it, and then rewrite it to a new file in the cleaner format? That might work pretty well. -- http://mail.python.org/mailman/listinfo/python-list
Re: Suggestions for how to approach this problem?
Necmettin Begiter wrote: > Is this how the text looks like: > > 123 > some information > > 124 some other information > > 126(tab here)something else > > If this is the case (the numbers are at the beginning, and after the numbers > there is either a newline or a tab, the logic might be this simple: They all seem to be a little different. One consistency is that each number is followed by two spaces. There is nothing separating each reference except a single newline, which I want to preserve. But within each reference there might be a combination of spaces, tabs, or newlines. -- http://mail.python.org/mailman/listinfo/python-list
Re: Suggestions for how to approach this problem?
John Salerno wrote: > Marc 'BlackJack' Rintsch wrote: > Here's what it looks like now: > > 1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray > irradiated > bacteriophage T2. J. Bacteriol. 87:1330-1338. > 2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R > factor. Lancet 2:1138. > 3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic > resistance factors in > Enterobacteriaceae. 34. The specific effects of the inhibitors of DNA > synthesis on the > transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83. > 4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician > 16:50-54. > 5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of > diverticular disease of the > colon: Evaluation of an eleven-year period. Annals Surg. 166:947-955. > > As you can see, any single citation is broken over several lines as a > result of a line break. I want it to look like this: > > 1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray > irradiated bacteriophage T2. J. Bacteriol. 87:1330-1338. > 2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R > factor. Lancet 2:1138. > 3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic > resistance factors in Enterobacteriaceae. 34. The specific effects > of the inhibitors of DNA synthesis on the > transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83. > 4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician > 16:50-54. > 5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of > diverticular disease of the colon: Evaluation of an eleven-year > period. Annals Surg. 166:947-955. > > Now, since this is pasted, it might not even look good to you. But in > the second example, the numbers are meant to be bullets and so the > indentation would happen automatically (in Word). But for now they are > just typed. If you can count on the person not skipping any numbers in the citations, you can take an "AI" approach to hopefully weed out the rare circumstance that a number followed by a period starts a line in the middle of the citation. This is not failsafe, say if you were on citation 33 and it was in chapter 34 and that 34 happend to start a new line. But, then again, even a human would take a little time to figure that one out--and probably wouldn't be 100% accurate either. I'm sure there is an AI word for the type of parser that could parse something like this unambiguously and I'm sure that it has been proven to be impossible to create: import re records = [] record = None counter = 1 regex = re.compile(r'^(\d+)\. (.*)') for aline in lines: m = regex.search(aline) if m is not None: recnum, aline = m.groups() if int(recnum) == counter: if record is not None: records.append(record) record = [aline.strip()] counter += 1 continue record.append(aline.strip()) if record is not None: records.append(record) records = [" ".join(r) for r in records] py> import re py> records = [] py> record = None py> counter = 1 py> regex = re.compile(r'^(\d+)\. (.*)') py> for aline in lines: ... m = regex.search(aline) ... if m is not None: ... recnum, aline = m.groups() ... if int(recnum) == counter: ... if record is not None: ... records.append(record) ... record = [aline.strip()] ... counter += 1 ... continue ... record.append(aline.strip()) ... py> if record is not None: ... records.append(record) ... py> records = [" ".join(r) for r in records] py> records ['Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray irradiated bacteriophage T2. J. Bacteriol. 87:1330-1338.', 'Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R factor. Lancet 2:1138.', 'Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic resistance factors in Enterobacteriaceae. 34. The specific effects of the inhibitors of DNA synthesis on the transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83.', 'Levy, S.B. (1967) Blood safari into Kenya. The New Physician 16:50-54.', 'Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of diverticular disease of the colon: Evaluation of an eleven-year period. Annals Surg. 166:947-955.'] James -- http://mail.python.org/mailman/listinfo/python-list
Re: Suggestions for how to approach this problem?
On May 8, 3:00 pm, John Salerno <[EMAIL PROTECTED]> wrote: > Marc 'BlackJack' Rintsch wrote: > > I think I have vague idea how the input looks like, but it would be > > helpful if you show some example input and wanted output. > > Good idea. Here's what it looks like now: > > 1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray > irradiated > bacteriophage T2. J. Bacteriol. 87:1330-1338. > 2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R > factor. Lancet 2:1138. > 3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic > resistance factors in > Enterobacteriaceae. 34. The specific effects of the inhibitors of DNA > synthesis on the > transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83. Questions: 1) Do the citation numbers always begin in column 1? 2) Are the citation numbers always followed by a period and then at least one whitespace character? If so, I'd probably use a regular expression like ^[0-9]+\.[ \t] to find the beginning of each cite. then I would output each cite through a state machine that would reduce consecutive whitespace characters (space, tab, newline) into a single character, separating each cite with a newline. Final formatting can be done with paragraph styles in Word. HTH, -=Dave -- http://mail.python.org/mailman/listinfo/python-list
Re: Suggestions for how to approach this problem?
On Tuesday 08 May 2007 22:23:31 John Salerno wrote: > John Salerno wrote: > > typed, there are often line breaks at the end of each line > > Also, there are sometimes tabs used to indent the subsequent lines of > citation, but I assume with that I can just replace the tab with a space. Is this how the text looks like: 123 some information 124 some other information 126(tab here)something else If this is the case (the numbers are at the beginning, and after the numbers there is either a newline or a tab, the logic might be this simple: get the numbers at the beginning of the line. Check for \n and \t after the number, if either exists, remove them or replace them with a space or whatever you prefer, and there you have it. Also, how are the records seperated? By empty lines? If so, \n\n is an empty line in a string, like this: """ some text here\n \n some other text here\n """ -- http://mail.python.org/mailman/listinfo/python-list
Re: Suggestions for how to approach this problem?
Marc 'BlackJack' Rintsch wrote: > I think I have vague idea how the input looks like, but it would be > helpful if you show some example input and wanted output. Good idea. Here's what it looks like now: 1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray irradiated bacteriophage T2. J. Bacteriol. 87:1330-1338. 2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R factor. Lancet 2:1138. 3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic resistance factors in Enterobacteriaceae. 34. The specific effects of the inhibitors of DNA synthesis on the transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83. 4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician 16:50-54. 5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of diverticular disease of the colon: Evaluation of an eleven-year period. Annals Surg. 166:947-955. As you can see, any single citation is broken over several lines as a result of a line break. I want it to look like this: 1. Levy, S.B. (1964) Isologous interference with ultraviolet and X-ray irradiated bacteriophage T2. J. Bacteriol. 87:1330-1338. 2. Levy, S.B. and T. Watanabe (1966) Mepacrine and transfer of R factor. Lancet 2:1138. 3. Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966) Episomic resistance factors in Enterobacteriaceae. 34. The specific effects of the inhibitors of DNA synthesis on the transfer of R factor and F factor. Med. Biol. (Tokyo) 73:79-83. 4. Levy, S.B. (1967) Blood safari into Kenya. The New Physician 16:50-54. 5. Levy, S.B., W.T. Fitts and J.B. Leach (1967) Surgical treatment of diverticular disease of the colon: Evaluation of an eleven-year period. Annals Surg. 166:947-955. Now, since this is pasted, it might not even look good to you. But in the second example, the numbers are meant to be bullets and so the indentation would happen automatically (in Word). But for now they are just typed. -- http://mail.python.org/mailman/listinfo/python-list
Re: Suggestions for how to approach this problem?
In <[EMAIL PROTECTED]>, John Salerno wrote: > I have a large list of publication citations that are numbered. The > numbers are simply typed in with the rest of the text. What I want to do > is remove the numbers and then put bullets instead. Now, this alone > would be easy enough, with a little Python and a little work by hand, > but the real issue is that because of the way these citations were > typed, there are often line breaks at the end of each line -- in other > words, the person didn't just let the line flow to the next line, they > manually pressed Enter. So inserting bullets at this point would put a > bullet at each line break. > > So I need to remove the line breaks too, but of course not *all* of them > because each reference still needs a line break between it. So I'm > hoping I could get an idea or two for approaching this. I figure regular > expressions will be needed, and maybe it would be good to remove the > line breaks first and *not* remove a line break that comes before the > numbers (because that would be the proper place for one), and then > finally remove the numbers. I think I have vague idea how the input looks like, but it would be helpful if you show some example input and wanted output. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: Suggestions for how to approach this problem?
John Salerno wrote: > typed, there are often line breaks at the end of each line Also, there are sometimes tabs used to indent the subsequent lines of citation, but I assume with that I can just replace the tab with a space. -- http://mail.python.org/mailman/listinfo/python-list
Suggestions for how to approach this problem?
I figured I might give myself a little project to make my life at work easier, so here's what I want to do: I have a large list of publication citations that are numbered. The numbers are simply typed in with the rest of the text. What I want to do is remove the numbers and then put bullets instead. Now, this alone would be easy enough, with a little Python and a little work by hand, but the real issue is that because of the way these citations were typed, there are often line breaks at the end of each line -- in other words, the person didn't just let the line flow to the next line, they manually pressed Enter. So inserting bullets at this point would put a bullet at each line break. So I need to remove the line breaks too, but of course not *all* of them because each reference still needs a line break between it. So I'm hoping I could get an idea or two for approaching this. I figure regular expressions will be needed, and maybe it would be good to remove the line breaks first and *not* remove a line break that comes before the numbers (because that would be the proper place for one), and then finally remove the numbers. Thanks. -- http://mail.python.org/mailman/listinfo/python-list