Re: [Tutor] need help generating table of contents
From: Tutor on behalf of Peter Otten <__pete...@web.de> Sent: Monday, August 27, 2018 6:43 PM To: tutor@python.org Subject: Re: [Tutor] need help generating table of contents Albert-Jan Roskam wrote: > > From: Tutor on behalf > of Peter Otten <__pete...@web.de> Sent: Friday, August 24, 2018 3:55 PM > To: tutor@python.org > >> The following reshuffle of your code seems to work: >> >> print('\r\n** Table of contents\r\n') >> pattern = '/Title \((.+?)\).+?/Page ([0-9]+)(?:\s+/Count ([0-9]+))?' >> >> def process(triples, limit=None, indent=0): >> for index, (title, page, count) in enumerate(triples, 1): >> title = indent * 4 * ' ' + title >> print(title.ljust(79, ".") + page.zfill(2)) >> if count: >> process(triples, limit=int(count), indent=indent+1) >> if limit is not None and limit == index: >> break >> >> process(iter(re.findall(pattern, toc, re.DOTALL))) > > Hi Peter, Cameron, > > Thanks for your replies! The code above indeeed works as intended, but: I > don't really understand *why*. I would assign a name to the following line > "if limit is not None and limit == index", what would be the most > descriptive name? I often use "is_*" names for boolean variables. Would > "is_deepest_nesting_level" be a good name? > No, it's not necessarily the deepest level. Every subsection eventually ends > at this point; so you might call it reached_end_of_current_section > > Or just 'limit' ;) LOL. Ok, now I get it :-) > The None is only there for the outermost level where no /Count is provided. > In this case the loop is exhausted. > > If you find it is easier to understand you can calculate the outer count aka > limit as the number of matches - sum of counts: > >> Also, I don't understand why iter() is required here, and why finditer() > >is not an alternative. >finditer() would actually work -- I didn't use it because I wanted to make > as few changes as possible to your code. What does not work is a list like >the result of findall(). This is because the inner for loops (i. e. the ones >in the nested calls of process) are supposed to continue the iteration >instead of restarting it. A simple example to illustrate the difference: Ah, the triples cannot be unpacked inside the "for" line of the loop. This works: def process(triples, limit=None, indent=0): for index, triple in enumerate(triples, 1): title, page, count = triple.groups() # unpack it here title = indent * 4 * ' ' + title print(title.ljust(79, ".") + page.zfill(2)) if count: process(triples, limit=int(count), indent=indent+1) if limit is not None and limit == index: break process(re.finditer(pattern, toc, re.DOTALL)) If I don't do this, I get this error: File "Q:/toc/toc.py", line 64, in process(re.finditer(pattern, toc, re.DOTALL)) File "Q:/Ctoc/toc.py", line 56, in process for index, (title, page, count) in enumerate(triples, 1): TypeError: '_sre.SRE_Match' object is not iterable Process finished with exit code 1 Thanks again Peter! Very insightful! Albert-Jan ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] need help generating table of contents
Albert-Jan Roskam wrote: > > From: Tutor on behalf > of Peter Otten <__pete...@web.de> Sent: Friday, August 24, 2018 3:55 PM > To: tutor@python.org > >> The following reshuffle of your code seems to work: >> >> print('\r\n** Table of contents\r\n') >> pattern = '/Title \((.+?)\).+?/Page ([0-9]+)(?:\s+/Count ([0-9]+))?' >> >> def process(triples, limit=None, indent=0): >> for index, (title, page, count) in enumerate(triples, 1): >> title = indent * 4 * ' ' + title >> print(title.ljust(79, ".") + page.zfill(2)) >> if count: >> process(triples, limit=int(count), indent=indent+1) >> if limit is not None and limit == index: >> break >> >> process(iter(re.findall(pattern, toc, re.DOTALL))) > > Hi Peter, Cameron, > > Thanks for your replies! The code above indeeed works as intended, but: I > don't really understand *why*. I would assign a name to the following line > "if limit is not None and limit == index", what would be the most > descriptive name? I often use "is_*" names for boolean variables. Would > "is_deepest_nesting_level" be a good name? No, it's not necessarily the deepest level. Every subsection eventually ends at this point; so you might call it reached_end_of_current_section Or just 'limit' ;) The None is only there for the outermost level where no /Count is provided. In this case the loop is exhausted. If you find it is easier to understand you can calculate the outer count aka limit as the number of matches - sum of counts: def process(triples, section_length, indent=0): for index, (title, page, count) in enumerate(triples, 1): title = indent * 4 * ' ' + title print(title.ljust(79, ".") + page.zfill(2)) if count: process(triples, section_length=int(count), indent=indent+1) if section_length == index: break triples = re.findall(pattern, toc, re.DOTALL) toplevel_section_length = ( len(triples) - sum(int(c or 0) for t, p, c in triples) ) process(iter(triples), toplevel_section_length) Just for fun here's one last variant that does away with the break -- and thus the naming issue -- completely: def process(triples, limit=None, indent=0): for title, page, count in itertools.islice(triples, limit): title = indent * 4 * ' ' + title print(title.ljust(79, ".") + page.zfill(2)) if count: process(triples, limit=int(count), indent=indent+1) Note that islice(items, None) does the right thing: >>> list(islice("abc", None)) ['a', 'b', 'c'] > Also, I don't understand why iter() is required here, and why finditer() > is not an alternative. finditer() would actually work -- I didn't use it because I wanted to make as few changes as possible to your code. What does not work is a list like the result of findall(). This is because the inner for loops (i. e. the ones in the nested calls of process) are supposed to continue the iteration instead of restarting it. A simple example to illustrate the difference: >>> s = "abcdefg" >>> for k in range(3): ... print("===", k, "===") ... for i, v in enumerate(s): ... print(v) ... if i == 2: break ... === 0 === a b c === 1 === a b c === 2 === a b c >>> s = iter("abcdefg") >>> for k in range(3): ... print("===", k, "===") ... for i, v in enumerate(s): ... print(v) ... if i == 2: break ... === 0 === a b c === 1 === d e f === 2 === g ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] need help generating table of contents
From: Tutor on behalf of Peter Otten <__pete...@web.de> Sent: Friday, August 24, 2018 3:55 PM To: tutor@python.org > The following reshuffle of your code seems to work: > > print('\r\n** Table of contents\r\n') > pattern = '/Title \((.+?)\).+?/Page ([0-9]+)(?:\s+/Count ([0-9]+))?' > > def process(triples, limit=None, indent=0): > for index, (title, page, count) in enumerate(triples, 1): > title = indent * 4 * ' ' + title > print(title.ljust(79, ".") + page.zfill(2)) > if count: > process(triples, limit=int(count), indent=indent+1) > if limit is not None and limit == index: > break > > process(iter(re.findall(pattern, toc, re.DOTALL))) Hi Peter, Cameron, Thanks for your replies! The code above indeeed works as intended, but: I don't really understand *why*. I would assign a name to the following line "if limit is not None and limit == index", what would be the most descriptive name? I often use "is_*" names for boolean variables. Would "is_deepest_nesting_level" be a good name? Also, I don't understand why iter() is required here, and why finditer() is not an alternative. I wrote the bookmarks file myself, and the code above is part of a shell script that compiles a large .pdf, with openoffice commandline calls, ghostscript, git, pdftk and python. The human-readable toc and the pdf bookmarks will always be consistent if I only need to edit one file. Thanks again! Albert-Jan ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] need help generating table of contents
On 24Aug2018 17:55, Peter Otten <__pete...@web.de> wrote: Albert-Jan Roskam wrote: I have Ghostscript files with a table of contents (toc) and I would like to use this info to generate a human-readable toc. The problem is: I can't get the (nested) hierarchy right. import re toc = """\ [ /PageMode /UseOutlines /Page 1 /View [/XYZ null null 0] /DOCVIEW pdfmark [ /Title (Title page) /Page 1 /View [/XYZ null null 0] /OUT pdfmark [ /Title (Document information) /Page 2 /View [/XYZ null null 0] /OUT pdfmark [...] What is the best approach to do this? The best approach is probably to use some tool/library that understands postscript. Just to this: I disagree. IIRC, there's no such thing as '/Title' etc in PostScript - these will all be PostScript functions defined by whatever made the document. So a generic tool won't have any way to extract semantics like titles from a document. The OP presumably has the specific output of a particular tool with this nice well structured postscript, so he needs to write his/her own special parser. Cheers, Cameron Simpson ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] need help generating table of contents
Albert-Jan Roskam wrote: > Hello, > > I have Ghostscript files with a table of contents (toc) and I would like to use this info to generate a human-readable toc. The problem is: I can't get the (nested) hierarchy right. > > import re > > toc = """\ > [ /PageMode /UseOutlines > /Page 1 > /View [/XYZ null null 0] > /DOCVIEW pdfmark > [ /Title (Title page) > /Page 1 > /View [/XYZ null null 0] > /OUT pdfmark > [ /Title (Document information) > /Page 2 > /View [/XYZ null null 0] > /OUT pdfmark > [ /Title (Blah) > /Page 3 > /View [/XYZ null null 0] > /OUT pdfmark > [ /Title (Appendix) > /Page 16 > /Count 4 > /View [/XYZ null null 0] > /OUT pdfmark > [ /Title (Sub1) > /Page 17 > /Count 4 > /OUT pdfmark > [ /Title (Subsub1) > /Page 17 > /OUT pdfmark > [ /Title (Subsub2) > /Page 18 > /OUT pdfmark > [ /Title (Subsub3) > /Page 29 > /OUT pdfmark > [ /Title (Subsub4) > /Page 37 > /OUT pdfmark > [ /Title (Sub2) > /Page 40 > /OUT pdfmark > [ /Title (Sub3) > /Page 49 > /OUT pdfmark > [ /Title (Sub4) > /Page 56 > /OUT pdfmark > """ > print('\r\n** Table of contents\r\n') > pattern = '/Title \((.+?)\).+?/Page ([0-9]+)(?:\s+/Count ([0-9]+))?' > indent = 0 > start = True > for title, page, count in re.findall(pattern, toc, re.DOTALL): > title = (indent * ' ') + title > count = int(count or 0) > print(title.ljust(79, ".") + page.zfill(2)) > if count: > count -= 1 > start = True > if count and start: > indent += 2 > start = False > if not count and not start: > indent -= 2 > start = True > > This generates the following TOC, with subsub2 to subsub4 dedented one level too much: > What is the best approach to do this? The best approach is probably to use some tool/library that understands postscript. However, your immediate problem is that when there is more than one level of indentation you only keep track of the "count" of the innermost level. You can either use a list of counts or use recursion and rely on the stack to remember the counts of the outer levels. The following reshuffle of your code seems to work: print('\r\n** Table of contents\r\n') pattern = '/Title \((.+?)\).+?/Page ([0-9]+)(?:\s+/Count ([0-9]+))?' def process(triples, limit=None, indent=0): for index, (title, page, count) in enumerate(triples, 1): title = indent * 4 * ' ' + title print(title.ljust(79, ".") + page.zfill(2)) if count: process(triples, limit=int(count), indent=indent+1) if limit is not None and limit == index: break process(iter(re.findall(pattern, toc, re.DOTALL))) ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] need help generating table of contents
Hello, I have Ghostscript files with a table of contents (toc) and I would like to use this info to generate a human-readable toc. The problem is: I can't get the (nested) hierarchy right. import re toc = """\ [ /PageMode /UseOutlines /Page 1 /View [/XYZ null null 0] /DOCVIEW pdfmark [ /Title (Title page) /Page 1 /View [/XYZ null null 0] /OUT pdfmark [ /Title (Document information) /Page 2 /View [/XYZ null null 0] /OUT pdfmark [ /Title (Blah) /Page 3 /View [/XYZ null null 0] /OUT pdfmark [ /Title (Appendix) /Page 16 /Count 4 /View [/XYZ null null 0] /OUT pdfmark [ /Title (Sub1) /Page 17 /Count 4 /OUT pdfmark [ /Title (Subsub1) /Page 17 /OUT pdfmark [ /Title (Subsub2) /Page 18 /OUT pdfmark [ /Title (Subsub3) /Page 29 /OUT pdfmark [ /Title (Subsub4) /Page 37 /OUT pdfmark [ /Title (Sub2) /Page 40 /OUT pdfmark [ /Title (Sub3) /Page 49 /OUT pdfmark [ /Title (Sub4) /Page 56 /OUT pdfmark """ print('\r\n** Table of contents\r\n') pattern = '/Title \((.+?)\).+?/Page ([0-9]+)(?:\s+/Count ([0-9]+))?' indent = 0 start = True for title, page, count in re.findall(pattern, toc, re.DOTALL): title = (indent * ' ') + title count = int(count or 0) print(title.ljust(79, ".") + page.zfill(2)) if count: count -= 1 start = True if count and start: indent += 2 start = False if not count and not start: indent -= 2 start = True This generates the following TOC, with subsub2 to subsub4 dedented one level too much: ** Table of contents Title page.01 Document information...02 Blah...03 Appendix...16 Sub1.17 Subsub117 Subsub2..18 Subsub3..29 Subsub4..37 Sub2.40 Sub3.49 Sub4.56 What is the best approach to do this? Thanks in advance! Albert-Jan ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor