Łukasz Langa <luk...@langa.pl> added the comment:

> I'm in favor of unifying the tokenizers and of updating and moving pgen2 
> (though I don't have time to do the work).

I'm willing to do all the work as long as I have somebody to review it. Case in 
point: BPO-33338.



> Also I think you may have to make a distinction between the parser generator 
> and its data structures, and between the generated parser for Python vs. the 
> parser for other LL(1) grammars one might feed into it.

Technically pgen2 has the ability to parse any LL(1) grammar but so far the 
plumbing is tightly tied to the tokenizer.  We'd need to enable plugging that 
in, too.



> And I don't think you're proposing to replace Parser/pgen.c with Lib/pgen/, 
> right?

No, I'm not.



> Nor to replace the CST actually used by CPython's parser with the data 
> structures used by pgen2's driver.

No, I'm not.



> So the relationship between the CST you propose to document and CPython 
> internals wouldn't be quite the same as that between the AST used by CPython 
> and the ast module (since those *do* actually use the same code).

Right.  Once we unify the standard library tokenizers (note: *not* tokenizer.c 
which will stay), there wouldn't be much extra documentation to write for 
Lib/tokenize.py.  For Lib/pgen/ itself, we'd need to provide both an API 
rundown and an intro to the high-level functionality (how to create trees from 
files, string, etc.; how to visit trees and edit them; and so on).


> I'm not sure if it's technically possible to give tokenize.py the ability to 
> tokenize Python 2.7 and up without some version-selection flag -- have you 
> researched this part yet?

There's two schools. This is going to take a while to explain :)

One school is to force the caller to declare what Python version they want to 
parse.  This is algorithmically cleaner because we can then literally take 
Grammar/Grammar from various versions of Python and have the user worry about 
picking the right one.

The other school is what lib2to3 does currently, which is to try to implement 
as much of a superset of Python versions as possible.  This is way easier to 
use because the grammar is very forgiving.  However, this has limitations.  
There are three major incompatibilities that we need to deal with, with raising 
degree of severity:
- async/await;
- print statements;
- exec statements.

Async and await became proper keywords in 3.7 and thus broke usage of those as 
names.  It's relatively easy to work around this one seamlessly by keeping the 
grammar trickery we've had in place for 3.5 and 3.6.  This is what lib2to3 does 
today already👍🏻

The print statement is fundamentally incompatible with the print function.  
lib2to3 has two grammar variants and most users by default choose the one 
without the print statement.  Why?  Because it cannot be reliably sniffed 
anymore.  Python 3-only code will not use the __future__ import.  In fact, 2to3 
also doesn't do auto-detection, relies on the user running `2to3 -p` to 
indicate they mean the grammar with the print function.

The exec statement is even worse because there isn't even a __future__ import.  
It's annoying because it creates a third combination. 👎🏻

So now the driver has to attempt three grammars (in this order):
- the almost compatible combined Python 2 + Python 3 one (that assumes exec is 
a function and print is a function);
- the one that assumes exec is a *statement* but print is still a function 
(because __future__ import);
- the one that exposes the legacy exec and print statements.

This approach has one annoying wart.  Imagine you have a file like this:

  print('msg', file=sys.stderr)
  if

Now the driver will attempt all three grammars and fail, and will report that 
the parse error is on the print line.  This can be overcome by comparing syntax 
errors from each grammar and showing the one on the furthest line (which is the 
most likely to be the real culprit).  But it's still annoying and will 
sometimes not do what the user wanted.


-- OK, OK. So which to choose?

And now, while this sounds like more work and is harder to get right, I still 
think the combined grammar with minimal incompatibilities is the better 
approach.  Why?  Two reasons.

1. Nobody ever knows what Python version *exactly* a given file is.  Most files 
aren't even considering compatibility that fine-grained.  And having to attempt 
to parse not three but potentially 8 grammars (3.7 - 3.2, 2.7, 2.6) would be 
prohibitively slow.

2. My tool maybe wants to actually *modify* the compatibility level by, say, 
rewriting ''.format() with f-strings or putting trailing commas where old 
Pythons didn't accept them.  So it would be awkward if the grammar I used to 
read the file wasn't compatible with my later changes.

Unless I'm swayed otherwise, I'd continue on what lib2to3 did, with the 
exception that we need to add a grammar variant without the `exec` statement, 
and the driver needs to attempt parsing with the three grammars on its own, 
with proper syntax error reporting.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue33337>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to