Re: [Tutor] need help generating table of contents

2018-08-28 Thread Albert-Jan Roskam
From: Tutor  on behalf of 
Peter Otten <__pete...@web.de>
Sent: Monday, August 27, 2018 6:43 PM
To: tutor@python.org
Subject: Re: [Tutor] need help generating table of contents
  

Albert-Jan Roskam wrote:

> 
> From: Tutor  on behalf
> of Peter Otten <__pete...@web.de> Sent: Friday, August 24, 2018 3:55 PM
> To: tutor@python.org
> 
>> The following reshuffle of your code seems to work:
>> 
>> print('\r\n** Table of contents\r\n')
>> pattern = '/Title \((.+?)\).+?/Page ([0-9]+)(?:\s+/Count ([0-9]+))?'
>> 
>> def process(triples, limit=None, indent=0):
>> for index, (title, page, count) in enumerate(triples, 1):
>> title = indent * 4 * ' ' + title
>> print(title.ljust(79, ".") + page.zfill(2))
>> if count:
>> process(triples, limit=int(count), indent=indent+1)
>> if limit is not None and limit == index:
>>  break
>> 
>> process(iter(re.findall(pattern, toc, re.DOTALL)))
> 
> Hi Peter, Cameron,
> 
> Thanks for your replies! The code above indeeed works as intended, but: I
> don't really understand *why*. I would assign a name to the following line
> "if limit is not None and limit == index", what would be the most
> descriptive name? I often use "is_*" names for boolean variables. Would
> "is_deepest_nesting_level" be a good name?



> No, it's not necessarily the deepest level. Every subsection eventually ends 
> at this point; so you might call it reached_end_of_current_section
> 
> Or just 'limit' ;) 

LOL. Ok, now I get it :-)

> The None is only there for the outermost level where no /Count is provided. 
> In this case the loop is exhausted.
> 
> If you find it is easier to understand you can calculate the outer count aka 
> limit as the number of matches - sum of counts:
> 



>> Also, I don't understand why iter() is required here, and why finditer()
> >is not an alternative.

>finditer() would actually work -- I didn't use it because I wanted to make 
> as few changes as possible to your code. What does not work is a list like 
>the result of findall(). This is because the inner for loops (i. e. the ones 
>in the nested calls of process) are supposed to continue the iteration 
>instead of restarting it. A simple example to illustrate the difference:

Ah, the triples cannot be unpacked inside the "for" line of the loop. This 
works:
def process(triples, limit=None, indent=0):
 for index, triple in enumerate(triples, 1):
 title, page, count = triple.groups()  # unpack it here
 title = indent * 4 * ' ' + title
 print(title.ljust(79, ".") + page.zfill(2))
 if count:
 process(triples, limit=int(count), indent=indent+1)
 if limit is not None and limit == index:
 break

process(re.finditer(pattern, toc, re.DOTALL))


If I don't do this, I get this error:
  File "Q:/toc/toc.py", line 64, in 
process(re.finditer(pattern, toc, re.DOTALL))
  File "Q:/Ctoc/toc.py", line 56, in process
for index, (title, page, count) in enumerate(triples, 1):
TypeError: '_sre.SRE_Match' object is not iterable

Process finished with exit code 1


Thanks again Peter! Very insightful!

Albert-Jan
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] need help generating table of contents

2018-08-27 Thread Peter Otten
Albert-Jan Roskam wrote:

> 
> From: Tutor  on behalf
> of Peter Otten <__pete...@web.de> Sent: Friday, August 24, 2018 3:55 PM
> To: tutor@python.org
> 
>> The following reshuffle of your code seems to work:
>> 
>> print('\r\n** Table of contents\r\n')
>> pattern = '/Title \((.+?)\).+?/Page ([0-9]+)(?:\s+/Count ([0-9]+))?'
>> 
>> def process(triples, limit=None, indent=0):
>> for index, (title, page, count) in enumerate(triples, 1):
>> title = indent * 4 * ' ' + title
>> print(title.ljust(79, ".") + page.zfill(2))
>> if count:
>> process(triples, limit=int(count), indent=indent+1)
>> if limit is not None and limit == index:
>>  break
>> 
>> process(iter(re.findall(pattern, toc, re.DOTALL)))
> 
> Hi Peter, Cameron,
> 
> Thanks for your replies! The code above indeeed works as intended, but: I
> don't really understand *why*. I would assign a name to the following line
> "if limit is not None and limit == index", what would be the most
> descriptive name? I often use "is_*" names for boolean variables. Would
> "is_deepest_nesting_level" be a good name?

No, it's not necessarily the deepest level. Every subsection eventually ends 
at this point; so you might call it

reached_end_of_current_section

Or just 'limit' ;) 

The None is only there for the outermost level where no /Count is provided. 
In this case the loop is exhausted.

If you find it is easier to understand you can calculate the outer count aka 
limit as the number of matches - sum of counts:

def process(triples, section_length, indent=0):
for index, (title, page, count) in enumerate(triples, 1):
title = indent * 4 * ' ' + title
print(title.ljust(79, ".") + page.zfill(2))
if count:
process(triples, section_length=int(count), indent=indent+1)
if section_length == index:
break

triples = re.findall(pattern, toc, re.DOTALL)
toplevel_section_length = (
len(triples)
- sum(int(c or 0) for t, p, c in triples)
)
process(iter(triples), toplevel_section_length)

Just for fun here's one last variant that does away with the break -- and 
thus the naming issue -- completely:

def process(triples, limit=None, indent=0):
for title, page, count in itertools.islice(triples, limit):
title = indent * 4 * ' ' + title
print(title.ljust(79, ".") + page.zfill(2))
if count:
process(triples, limit=int(count), indent=indent+1)

Note that islice(items, None) does the right thing:

>>> list(islice("abc", None))
['a', 'b', 'c']


> Also, I don't understand why iter() is required here, and why finditer()
> is not an alternative.

finditer() would actually work -- I didn't use it because I wanted to make 
as few changes as possible to your code. What does not work is a list like 
the result of findall(). This is because the inner for loops (i. e. the ones 
in the nested calls of process) are supposed to continue the iteration 
instead of restarting it. A simple example to illustrate the difference:

 >>> s = "abcdefg"
>>> for k in range(3):
... print("===", k, "===")
... for i, v in enumerate(s):
... print(v)
... if i == 2: break
... 
=== 0 ===
a
b
c
=== 1 ===
a
b
c
=== 2 ===
a
b
c
>>> s = iter("abcdefg")
>>> for k in range(3):
... print("===", k, "===")
... for i, v in enumerate(s):
... print(v)
... if i == 2: break
... 
=== 0 ===
a
b
c
=== 1 ===
d
e
f
=== 2 ===
g




___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] need help generating table of contents

2018-08-27 Thread Albert-Jan Roskam


From: Tutor  on behalf of 
Peter Otten <__pete...@web.de>
Sent: Friday, August 24, 2018 3:55 PM
To: tutor@python.org

> The following reshuffle of your code seems to work:
> 
> print('\r\n** Table of contents\r\n')
> pattern = '/Title \((.+?)\).+?/Page ([0-9]+)(?:\s+/Count ([0-9]+))?'
> 
> def process(triples, limit=None, indent=0):
>     for index, (title, page, count) in enumerate(triples, 1):
>     title = indent * 4 * ' ' + title
>     print(title.ljust(79, ".") + page.zfill(2))
>     if count:
>     process(triples, limit=int(count), indent=indent+1)
>     if limit is not None and limit == index:
>     break
> 
> process(iter(re.findall(pattern, toc, re.DOTALL)))

Hi Peter, Cameron,

Thanks for your replies! The code above indeeed works as intended, but: I don't 
really understand *why*.
I would assign a name to the following line "if limit is not None and limit == 
index", what would be the most descriptive name? I often use "is_*" names for 
boolean variables. Would "is_deepest_nesting_level" be a good name?

Also, I don't understand why iter() is required here, and why finditer() is not 
an alternative.

I wrote the bookmarks file myself, and the code above is part of a shell script 
that compiles a large .pdf, with openoffice commandline calls, ghostscript, 
git, pdftk and python. The human-readable toc and the pdf bookmarks will always 
be consistent if I only need to edit one file.

Thanks again!

Albert-Jan
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] need help generating table of contents

2018-08-25 Thread Cameron Simpson

On 24Aug2018 17:55, Peter Otten <__pete...@web.de> wrote:

Albert-Jan Roskam wrote:

I have Ghostscript files with a table of contents (toc) and I would like

to use this info to generate a human-readable toc. The problem is: I can't
get the (nested) hierarchy right.


import re

toc = """\
[ /PageMode /UseOutlines
  /Page 1
  /View [/XYZ null null 0]
  /DOCVIEW pdfmark
[ /Title (Title page)
  /Page 1
  /View [/XYZ null null 0]
  /OUT pdfmark
[ /Title (Document information)
  /Page 2
  /View [/XYZ null null 0]
  /OUT pdfmark

[...]

What is the best approach to do this?


The best approach is probably to use some tool/library that understands
postscript.


Just to this: I disagree. IIRC, there's no such thing as '/Title' etc in 
PostScript - these will all be PostScript functions defined by whatever made 
the document.  So a generic tool won't have any way to extract semantics like 
titles from a document.


The OP presumably has the specific output of a particular tool with this nice 
well structured postscript, so he needs to write his/her own special parser.


Cheers,
Cameron Simpson 
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] need help generating table of contents

2018-08-24 Thread Peter Otten
Albert-Jan Roskam wrote:

> Hello,
> 
> I have Ghostscript files with a table of contents (toc) and I would like 
to use this info to generate a human-readable toc. The problem is: I can't 
get the (nested) hierarchy right.
> 
> import re
> 
> toc = """\
> [ /PageMode /UseOutlines
>   /Page 1
>   /View [/XYZ null null 0]
>   /DOCVIEW pdfmark
> [ /Title (Title page)
>   /Page 1
>   /View [/XYZ null null 0]
>   /OUT pdfmark
> [ /Title (Document information)
>   /Page 2
>   /View [/XYZ null null 0]
>   /OUT pdfmark
> [ /Title (Blah)
>   /Page 3
>   /View [/XYZ null null 0]
>   /OUT pdfmark
> [ /Title (Appendix)
>   /Page 16
>   /Count 4
>   /View [/XYZ null null 0]
>   /OUT pdfmark
> [ /Title (Sub1)
>   /Page 17
>   /Count 4
>   /OUT pdfmark
> [ /Title (Subsub1)
>   /Page 17
>   /OUT pdfmark
> [ /Title (Subsub2)
>   /Page 18
>   /OUT pdfmark
> [ /Title (Subsub3)
>   /Page 29
>   /OUT pdfmark
> [ /Title (Subsub4)
>   /Page 37
>   /OUT pdfmark
> [ /Title (Sub2)
>   /Page 40
>   /OUT pdfmark
> [ /Title (Sub3)
>   /Page 49
>   /OUT pdfmark
> [ /Title (Sub4)
>   /Page 56
>   /OUT pdfmark
> """
> print('\r\n** Table of contents\r\n')
> pattern = '/Title \((.+?)\).+?/Page ([0-9]+)(?:\s+/Count ([0-9]+))?'
> indent = 0
> start = True
> for title, page, count in re.findall(pattern, toc, re.DOTALL):
> title = (indent * ' ') + title
> count = int(count or 0)
> print(title.ljust(79, ".") + page.zfill(2))
> if count:
> count -= 1
> start = True
> if count and start:
> indent += 2
> start = False
> if not count and not start:
> indent -= 2
> start = True
> 
> This generates the following TOC, with subsub2 to subsub4 dedented one 
level too much:

> What is the best approach to do this?
 
The best approach is probably to use some tool/library that understands 
postscript. However, your immediate problem is that when there is more than 
one level of indentation you only keep track of the "count" of the innermost 
level. You can either use a list of counts or use recursion and rely on the 
stack to remember the counts of the outer levels.

The following reshuffle of your code seems to work:

print('\r\n** Table of contents\r\n')
pattern = '/Title \((.+?)\).+?/Page ([0-9]+)(?:\s+/Count ([0-9]+))?'

def process(triples, limit=None, indent=0):
for index, (title, page, count) in enumerate(triples, 1):
title = indent * 4 * ' ' + title
print(title.ljust(79, ".") + page.zfill(2))
if count:
process(triples, limit=int(count), indent=indent+1)
if limit is not None and limit == index:
break

process(iter(re.findall(pattern, toc, re.DOTALL)))



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


[Tutor] need help generating table of contents

2018-08-24 Thread Albert-Jan Roskam
Hello,

I have Ghostscript files with a table of contents (toc) and I would like to use 
this info to generate a human-readable toc. The problem is: I can't get the 
(nested) hierarchy right.

import re

toc = """\
[ /PageMode /UseOutlines
  /Page 1
  /View [/XYZ null null 0]
  /DOCVIEW pdfmark
[ /Title (Title page)
  /Page 1
  /View [/XYZ null null 0]
  /OUT pdfmark
[ /Title (Document information)
  /Page 2
  /View [/XYZ null null 0]
  /OUT pdfmark
[ /Title (Blah)
  /Page 3
  /View [/XYZ null null 0]
  /OUT pdfmark
[ /Title (Appendix)
  /Page 16
  /Count 4
  /View [/XYZ null null 0]
  /OUT pdfmark
    [ /Title (Sub1)
  /Page 17
  /Count 4
  /OUT pdfmark
    [ /Title (Subsub1)
  /Page 17
  /OUT pdfmark
    [ /Title (Subsub2)
  /Page 18
  /OUT pdfmark
    [ /Title (Subsub3)
  /Page 29
  /OUT pdfmark
    [ /Title (Subsub4)
  /Page 37
  /OUT pdfmark
    [ /Title (Sub2)
  /Page 40
  /OUT pdfmark
    [ /Title (Sub3)
  /Page 49
  /OUT pdfmark
    [ /Title (Sub4)
  /Page 56
  /OUT pdfmark
"""    
print('\r\n** Table of contents\r\n')
pattern = '/Title \((.+?)\).+?/Page ([0-9]+)(?:\s+/Count ([0-9]+))?'
indent = 0
start = True
for title, page, count in re.findall(pattern, toc, re.DOTALL):
    title = (indent * ' ') + title
    count = int(count or 0)
    print(title.ljust(79, ".") + page.zfill(2))
    if count:
    count -= 1
    start = True
    if count and start:
    indent += 2
    start = False
    if not count and not start:
    indent -= 2
    start = True

This generates the following TOC, with subsub2 to subsub4 dedented one level 
too much:


** Table of contents

Title 
page.01
Document 
information...02
Blah...03
Appendix...16
  
Sub1.17
    
Subsub117
  
Subsub2..18
  
Subsub3..29
  
Subsub4..37
  
Sub2.40
  
Sub3.49
  
Sub4.56

What is the best approach to do this?

Thanks in advance!

Albert-Jan
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor