[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-05 Thread Gareth Rees

Gareth Rees g...@garethrees.org added the comment:

Terry: agreed. Does anyone actually use this module? Does anyone know what the 
design goals are for tokenize? If someone can tell me, I'll do my best to make 
it meet them.

Meanwhile, here's another bug. Each character of trailing whitespace is 
tokenized as an ERRORTOKEN.

Python 3.3.0a0 (default:c099ba0a278e, Aug  2 2011, 12:35:03) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on 
darwin
Type help, copyright, credits or license for more information.
 from tokenize import tokenize,untokenize
 from io import BytesIO
 list(tokenize(BytesIO('1 '.encode('utf8')).readline))
[TokenInfo(type=57 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), 
line=''), TokenInfo(type=2 (NUMBER), string='1', start=(1, 0), end=(1, 1), 
line='1 '), TokenInfo(type=54 (ERRORTOKEN), string=' ', start=(1, 1), end=(1, 
2), line='1 '), TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 
0), line='')]

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12691] tokenize.untokenize is broken

2011-08-05 Thread Gareth Rees

Gareth Rees g...@garethrees.org added the comment:

Please find attached a patch containing four bug fixes for untokenize():

* untokenize() now always returns a bytes object, defaulting to UTF-8 if no 
ENCODING token is found (previously it returned a string in this case).
* In compatibility mode, untokenize() successfully processes all tokens from an 
iterator (previously it discarded the first token).
* In full mode, untokenize() now returns successfully (previously it asserted).
* In full mode, untokenize() successfully processes tokens that were separated 
by a backslashed newline in the original source (previously it ran these tokens 
together).

In addition, I've added some unit tests:

* Test case for backslashed newline.
* Test case for missing ENCODING token.
* roundtrip() tests both modes of untokenize() (previously it just tested 
compatibility mode).

and updated the documentation:

* Update the docstring for untokenize to better describe its actual behaviour, 
and remove the false claim Untokenized source will match input source 
exactly. (We can restore this claim if we ever fix tokenize/untokenize so that 
it's true.)
* Update the documentation for untokenize in tokenize.rdt to match the 
docstring.

I welcome review: this is my first proper patch to Python.

--
keywords: +patch
Added file: http://bugs.python.org/file22842/Issue12691.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12691
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12700] test_faulthandler fails on Mac OS X Lion

2011-08-05 Thread Gareth Rees

New submission from Gareth Rees g...@garethrees.org:

On Mac OS 10.7, test_faulthandler fails. See test output below.

It looks as though the tests may be at fault in expecting to see 
(?:Segmentation fault|Bus error) instead of (?:Segmentation fault|Bus 
error|Illegal instruction).

test_disable (__main__.FaultHandlerTests) ... ok
test_dump_traceback (__main__.FaultHandlerTests) ... ok
test_dump_traceback_file (__main__.FaultHandlerTests) ... ok
test_dump_traceback_threads (__main__.FaultHandlerTests) ... ok
test_dump_traceback_threads_file (__main__.FaultHandlerTests) ... ok
test_dump_tracebacks_later (__main__.FaultHandlerTests) ... ok
test_dump_tracebacks_later_cancel (__main__.FaultHandlerTests) ... ok
test_dump_tracebacks_later_file (__main__.FaultHandlerTests) ... ok
test_dump_tracebacks_later_repeat (__main__.FaultHandlerTests) ... ok
test_dump_tracebacks_later_twice (__main__.FaultHandlerTests) ... ok
test_enable_file (__main__.FaultHandlerTests) ... FAIL
test_enable_single_thread (__main__.FaultHandlerTests) ... FAIL
test_fatal_error (__main__.FaultHandlerTests) ... ok
test_gil_released (__main__.FaultHandlerTests) ... FAIL
test_is_enabled (__main__.FaultHandlerTests) ... ok
test_read_null (__main__.FaultHandlerTests) ... FAIL
test_register (__main__.FaultHandlerTests) ... ok
test_register_chain (__main__.FaultHandlerTests) ... ok
test_register_file (__main__.FaultHandlerTests) ... ok
test_register_threads (__main__.FaultHandlerTests) ... ok
test_sigabrt (__main__.FaultHandlerTests) ... ok
test_sigbus (__main__.FaultHandlerTests) ... ok
test_sigfpe (__main__.FaultHandlerTests) ... ok
test_sigill (__main__.FaultHandlerTests) ... ok
test_sigsegv (__main__.FaultHandlerTests) ... ok
test_stack_overflow (__main__.FaultHandlerTests) ... ok
test_unregister (__main__.FaultHandlerTests) ... ok

==
FAIL: test_enable_file (__main__.FaultHandlerTests)
--
Traceback (most recent call last):
  File test_faulthandler.py, line 207, in test_enable_file
filename=filename)
  File test_faulthandler.py, line 105, in check_fatal_error
self.assertRegex(output, regex)
AssertionError: Regex didn't match: '^Fatal Python error: (?:Segmentation 
fault|Bus error)\n\nCurrent\\ thread\\ XXX:\n  File string, line 4 in 
module$' not found in 'Fatal Python error: Illegal instruction\n\nCurrent 
thread XXX:\n  File string, line 4 in module'

==
FAIL: test_enable_single_thread (__main__.FaultHandlerTests)
--
Traceback (most recent call last):
  File test_faulthandler.py, line 217, in test_enable_single_thread
all_threads=False)
  File test_faulthandler.py, line 105, in check_fatal_error
self.assertRegex(output, regex)
AssertionError: Regex didn't match: '^Fatal Python error: (?:Segmentation 
fault|Bus error)\n\nTraceback\\ \\(most\\ recent\\ call\\ first\\):\n  File 
string, line 3 in module$' not found in 'Fatal Python error: Illegal 
instruction\n\nTraceback (most recent call first):\n  File string, line 3 
in module'

==
FAIL: test_gil_released (__main__.FaultHandlerTests)
--
Traceback (most recent call last):
  File test_faulthandler.py, line 195, in test_gil_released
'(?:Segmentation fault|Bus error)')
  File test_faulthandler.py, line 105, in check_fatal_error
self.assertRegex(output, regex)
AssertionError: Regex didn't match: '^Fatal Python error: (?:Segmentation 
fault|Bus error)\n\nCurrent\\ thread\\ XXX:\n  File string, line 3 in 
module$' not found in 'Fatal Python error: Illegal instruction\n\nCurrent 
thread XXX:\n  File string, line 3 in module'

==
FAIL: test_read_null (__main__.FaultHandlerTests)
--
Traceback (most recent call last):
  File test_faulthandler.py, line 115, in test_read_null
'(?:Segmentation fault|Bus error)')
  File test_faulthandler.py, line 105, in check_fatal_error
self.assertRegex(output, regex)
AssertionError: Regex didn't match: '^Fatal Python error: (?:Segmentation 
fault|Bus error)\n\nCurrent\\ thread\\ XXX:\n  File string, line 3 in 
module$' not found in 'Fatal Python error: Illegal instruction\n\nCurrent 
thread XXX:\n  File string, line 3 in module'

--
Ran 27 tests in 21.711s

FAILED (failures=4

[issue12691] tokenize.untokenize is broken

2011-08-05 Thread Gareth Rees

Gareth Rees g...@garethrees.org added the comment:

Thanks Ezio for the review. I've made all the changes you requested, (except 
for the re-ordering of paragraphs in the documentation, which I don't want to 
do because that would lead to the round-trip property being mentioned before 
it's defined). Revised patch attached.

--
Added file: http://bugs.python.org/file22844/Issue12691.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12691
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-04 Thread Gareth Rees

Gareth Rees g...@garethrees.org added the comment:

I'm having a look to see if I can make tokenize.py better match the real 
tokenizer, but I need some feedback on a couple of design decisions. 

First, how to handle tokenization errors? There are three possibilities:

1. Generate an ERRORTOKEN, resynchronize, and continue to tokenize from after 
the error. This is what tokenize.py currently does in the two cases where it 
detects an error.

2. Generate an ERRORTOKEN and stop tokenizing. This is what tokenizer.c does.

3. Raise an exception (IndentationError, SyntaxError, or TabError). This is 
what the user sees when the parser is invoked from pythonrun.c.

Since the documentation for tokenize.py says, It is designed to match the 
working of the Python tokenizer exactly, I think that implementing option (2) 
is best here. (This will mean changing the behaviour of tokenize.py in the two 
cases where it currently detects an error, so that it stops tokenizing.)

Second, how to record the cause of the error? The real tokenizer records the 
cause of the error in the 'done' field of the 'tok_state structure, but 
tokenize.py loses this information. I propose to add fields to the TokenInfo 
structure (which is a namedtuple) to record this information. The real 
tokenizer uses numeric constants from errcode.h (E_TOODEEP, E_TABSPACE, 
E_DEDENT etc), and pythonrun.c converts these to English-language error 
messages (E_TOODEEP: too many levels of indentation). Both of these pieces of 
information will be useful, so I propose to add two fields error (containing 
a string like TOODEEP) and errormessage (containing the English-language 
error message).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-04 Thread Gareth Rees

Gareth Rees g...@garethrees.org added the comment:

Having looked at some of the consumers of the tokenize module, I don't think my 
proposed solutions will work.

It seems to be the case that the resynchronization behaviour of tokenize.py is 
important for consumers that are using it to transform arbitrary Python source 
code (like 2to3.py). These consumers are relying on the roundtrip property 
that X == untokenize(tokenize(X)). So solution (1) is necessary for the 
handling of tokenization errors.

Also, that fact that TokenInfo is a 5-tuple is relied on in some places (e.g. 
lib2to3/patcomp.py line 38), so it can't be extended. And there are consumers 
(though none in the standard library) that are relying on type=ERRORTOKEN being 
the way to detect errors in a tokenization stream. So I can't overload that 
field of the structure.

Any good ideas for how to record the cause of error without breaking backwards 
compatibility?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-04 Thread Gareth Rees

Gareth Rees g...@garethrees.org added the comment:

Ah ... TokenInfo is a *subclass* of namedtuple, so I can add extra properties 
to it without breaking consumers that expect it to be a 5-tuple.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12691] tokenize.untokenize is broken

2011-08-04 Thread Gareth Rees

New submission from Gareth Rees g...@garethrees.org:

tokenize.untokenize is completely broken.

Python 3.2.1 (default, Jul 19 2011, 00:09:43) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type help, copyright, credits or license for more information.
 import tokenize, io
 t = list(tokenize.tokenize(io.BytesIO('1+1'.encode('utf8')).readline))
 tokenize.untokenize(t)
Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/tokenize.py,
 line 250, in untokenize
out = ut.untokenize(iterable)
  File 
/opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/tokenize.py,
 line 179, in untokenize
self.add_whitespace(start)
  File 
/opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/tokenize.py,
 line 165, in add_whitespace
assert row = self.prev_row
AssertionError

The assertion is simply bogus: the = should be =.

The reason why no-one has spotted this is that the unit tests for the tokenize 
module only ever call untokenize() in compatibility mode, passing in a 
2-tuple instead of a 5-tuple.

I propose to fix this, and add unit tests, at the same time as fixing other 
problems with tokenize.py (issue12675).

--
components: Library (Lib)
messages: 141634
nosy: Gareth.Rees
priority: normal
severity: normal
status: open
title: tokenize.untokenize is broken
type: behavior
versions: Python 3.2, Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12691
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12691] tokenize.untokenize is broken

2011-08-04 Thread Gareth Rees

Gareth Rees g...@garethrees.org added the comment:

See my last paragraph: I propose to deliver a single patch that fixes both this 
bug and issue12675. I hope this is OK. (If you prefer, I'll try to split the 
patch in two.)

I just noticed another bug in untokenize(): in compatibility mode, if 
untokenize() is passed an iterator rather than a list, then the first token 
gets discarded:

Python 3.3.0a0 (default:c099ba0a278e, Aug  2 2011, 12:35:03) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on 
darwin
Type help, copyright, credits or license for more information.
 from tokenize import untokenize
 from token import *
 untokenize([(NAME, 'hello')])
'hello '
 untokenize(iter([(NAME, 'hello')]))
''

No-one's noticed this because the unit tests only ever pass lists to 
untokenize().

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12691
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-01 Thread Gareth Rees

New submission from Gareth Rees g...@garethrees.org:

The tokenize module is happy to tokenize Python source code that the real 
tokenizer would reject. Pretty much any instance where tokenizer.c returns 
ERRORTOKEN will illustrate this feature. Here are some examples:

Python 3.3.0a0 (default:2d69900c0820, Aug  1 2011, 13:46:51) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on 
darwin
Type help, copyright, credits or license for more information.
 from tokenize import generate_tokens
 from io import StringIO
 def tokens(s):
...Return a string showing the tokens in the string s.
...return '|'.join(t[1] for t in generate_tokens(StringIO(s).readline))
...
 # Bad exponent
 print(tokens('1if 2else 3'))
1|if|2|else|3|
 1if 2else 3
  File stdin, line 1
1if 2else 3
 ^
SyntaxError: invalid token
 # Bad hexadecimal constant.
 print(tokens('0xfg'))
0xf|g|
 0xfg
  File stdin, line 1
0xfg
   ^
SyntaxError: invalid syntax
 # Missing newline after continuation character.
 print(tokens('\\pass'))
\|pass|
 \pass 
  File stdin, line 1
\pass
^
SyntaxError: unexpected character after line continuation character

It is surprising that the tokenize module does not yield the same tokens as 
Python itself, but as this limitation only affects incorrect Python code, 
perhaps it just needs a mention in the tokenize documentation. Something along 
the lines of, The tokenize module generates the same tokens as Python's own 
tokenizer if it is given correct Python code. However, it may incorrectly 
tokenize Python code containing syntax errors that the real tokenizer would 
reject.

--
components: Library (Lib)
messages: 141503
nosy: Gareth.Rees
priority: normal
severity: normal
status: open
title: tokenize module happily tokenizes code with syntax errors
type: behavior
versions: Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12675] tokenize module happily tokenizes code with syntax errors

2011-08-01 Thread Gareth Rees

Gareth Rees g...@garethrees.org added the comment:

These errors are generated directly by the tokenizer. In tokenizer.c, the 
tokenizer generates ERRORTOKEN when it encounters something it can't tokenize. 
This causes parsetok() in parsetok.c to stop tokenizing and return an error.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12675
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12514] timeit disables garbage collection if timed code raises an exception

2011-07-07 Thread Gareth Rees

New submission from Gareth Rees g...@garethrees.org:

If you call timeit.timeit and the timed code raises an exception, then garbage 
collection is disabled. I have verified this in Python 2.7 and 3.2. Here's an 
interaction with Python 3.2:

Python 3.2 (r32:88445, Jul  7 2011, 15:52:49) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type help, copyright, credits or license for more information.
 import timeit, gc
 gc.isenabled()
True
 timeit.timeit('raise Exception')
Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/timeit.py,
 line 228, in timeit
return Timer(stmt, setup, timer).timeit(number)
  File 
/opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/timeit.py,
 line 194, in timeit
timing = self.inner(it, self.timer)
  File timeit-src, line 6, in inner
Exception
 gc.isenabled()
False

The problem is with the following code in Lib/timeit.py (lines 192–196):

gcold = gc.isenabled()
gc.disable()
timing = self.inner(it, self.timer)
if gcold:
gc.enable()

This should be changed to something like this:

gcold = gc.isenabled()
gc.disable()
try:
timing = self.inner(it, self.timer)
finally:
if gcold:
gc.enable()

--
components: Library (Lib)
messages: 139978
nosy: Gareth.Rees
priority: normal
severity: normal
status: open
title: timeit disables garbage collection if timed code raises an exception
type: behavior
versions: Python 2.7, Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12514
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12514] timeit disables garbage collection if timed code raises an exception

2011-07-07 Thread Gareth Rees

Gareth Rees g...@garethrees.org added the comment:

Patch attached.

--
keywords: +patch
Added file: http://bugs.python.org/file22605/issue12514.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12514
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



<    1   2