On 01/02/2014 03:21 AM, Steven D'Aprano wrote:
On Wed, Jan 01, 2014 at 02:49:17PM +0100, spir wrote:
On 01/01/2014 01:26 AM, Steven D'Aprano wrote:
On Tue, Dec 31, 2013 at 03:35:55PM +0100, spir wrote:
[...]
I take the opportunity to add a few features, but would do
without Source altogether if it were not for 'i'.
The reason is: it is for parsing library, or hand-made parsers. Every
matching func, representing a pattern (or "rule"), advances in source
whenever mathc is ok, right? Thus in addition to return the form (of what
was matched), they must return the new match index:
        return (form, i)

The usual way to do this is to make the matching index an attribute of
the parser, not the text being parsed. In OOP code, you make the parser
an object:

class Parser:
     def __init__(self, source):
         self.current_position = 0  # Always start at the beginning
         self.source = source
     def parse(self):
         ...

parser = Parser("some text to be parsed")
for token in parser.parse():
     handle(token)

The index is not an attribute of the source text, because the source
text doesn't care about the index. Only the parser cares about the
index, so it should be the responsibility of the parser to manage.

There is (no need for) a distinct Parser class or evne notion of parser. A parser is a top-level pattern (rule, object, or match func if one designs more like FP than OOP). Symmetrically, every pattern is a parser for what it matches.

Think at branches in a tree: the tree is a top-level branch and every branch is a sub-tree.

This conception is not semantically meaningful but highly useful, if not necessary, in practice: it permits using every pattern on its own ,for what it matches; it permits trying and testing patterns indicidually. (One could always find and implement workarounds, they would be needless complications.)

However, I see and partially share what you say -- se below.

Symmetrically, every match func using another (meaning nearly all) receive
this pair. (Less annoyingly, every math func also takes i as input, in
addition to the src str.) (There are also a handful of other annoying
points, consequences of those ones.)

The match functions are a property of the parser, not the source text.
So they should be methods on a Parser object. Since they need to track
the index (or indexes), the index ought to be an attribute on the
parser, not the source text.

This does not hold for me. Think eg at 2-phase parsing (like when using lex & yacc): the lexer (lex) provides the parser (yacc) with a stream of lexemes completely opaquely for the parser, which does not know about indexes (neither in original source sting, nore in the stream of lexemes). Since I parse (usually) in a single phase, the source string is in the position lex above: it feeds the parser with a stream of ucodes, advancing in coherent manner; the parser does not need, nore want (lol!) to manage the source's index That's the point. The index is a given for the parser, that it just uses to try & match at the correct position.

If I have a string that stores its index, all of this mess is gone.

What you are describing is covered by Martin Fowler's book
"Refactoring". He describes the problem:

     A field is, or will be, used by another class more than the
     class on which it is defined.

and the solution is to move the field from that class to the class where
it is actually used.

("Refactoring - Ruby Edition", by Jay Fields, Shane Harvie and Martin
Fowler.)

Having a class (in your case, Source) carry around state which is only
used by *other functions* is a code-smell. That means that Source is
responsible for things it has no need of. That's poor design.

I don't share this. Just like an open file currently beeing read conceptually (if not in practice) has a current index.

It's also sane & simple, and thread-safe, even if two Source objects happened to share (refs to) the same underlying actual source string (eg read form the same source file): each has its own current index.

I don't see how Fowler's views apply to this case. Whether a Source or a Parser holds the index does not change attribute access or its wishable properties.

By making the parser a class, instead of a bunch of functions, they can
share state -- the *parser state*. That state includes:

- the text being parsed;
- the tokens that can be found; and
- the position in the text.

The caller can create as many parsers as they need:

parse_this = Parser("some text")
parse_that = Parser("different text")

without them interfering, and then run the parsers independently of each
other. The implementer, that is you, can change the algorithm used by
the Parser without the caller needing to know. With your current design,
you start with this:

# caller is responsible for tracking the index
source = Source("some text")
assert source.i = 0
parse(source)

Maybe we just don't have the same experience or practice of parsing, but your reflexion does not match (sic!) anything I know. I don't see how having a source hold its index prevents anything above, in particular, how does it prevent to "change the algorithm used by the Parser without the caller needing to know"?

What happens if next month you decide to change the parsing algorithm?
Now it needs not one index, but two.

?
what do you mean?

 You change the parse() function,
but the caller's code breaks because Source only has one index.

?
Everyone of us constantly changes the algorithm when in development phase, and setting up patterns for a new kind of sources, don't we? What does it have to do with where the match index is stored?

 You
can't change Source, because other parts of the code are relying on
Source having exactly a single index.

?

 So you have to introduce *two* new
pieces of code, and the caller has to make two changes::

source = SourceWithTwoIndexes("some text")
assert source.i = 0 and source.j = -1
improved_parse(source)

???

Instead, if the parser is responsible for tracking it's own data (the
index, or indexes), then the caller doesn't need to care if the parsing
algorithm changes. The internal details of the parser are irrelevant to
the caller. This is a good thing!

A source is just a source, not part of, container of, or in any way related to the parsing algorithm. Changing an algo does not interfere with the source in any manner I can imagine.

parser = Parse("some text")
parser.parse()

With this design, if you change the internal details of the parser, the
caller doesn't need to change a thing. They get the improved parser for
free.

Since the parser tracks both the source text and the index, it doesn't
need to worry that the Source object might change the index.

With your design, the index is part of the source text. That means that
the source text is free to change the index at any time. But it can't do
that, since there might be a parser in the middle of processing it. So
the Source class has to carry around data that it isn't free to use.

This is the opposite of encapsulation. It means that the Source object
and the parsing code are tightly coupled. The Source object has no way
of knowing whether it is being parsed or not, but has to carry around
this dead weight, an unused (unused by Source) field, and avoid using it
for any reason, *just in case* it is being used by a parser. This is the
very opposite of how OOP is supposed to work.

I understand, i guess, the underlying concerns expressed in your views (I guess). Like separation of concerns, and things holding their own data. This is in fact related to why I want sources to know their index. An alternative is to consider the whole library, or parser module (a bunch of matching patterns) as a global tool, a parsing machine, and make i a module-level var. Not a global in the usual sense, it's really part of the parsing machine (part of the machine's state), an attribute thus rather than a plain var. But this prevented beeing thread-safe and (how unlikely it may be to ever parse sources in parallel, i don't know of any example) this looked somewhat esthetically unsatisfying to me.

The price is having attribute access (in code and execution time) for every actual read into the source. I dislike this, but less than having a "lost" var roaming around ;-).

It
makes for clean and simple interfaces everywhere. Also (one of the
consequences) I can directly provide match funcs to the user, instead of
having to wrap them inside a func which only utility is to hide the
additional index (in both input & output).

I don't quite understand what you mean here.

If every match func, or 'match' method of pattern objects, takes & returns the index (in addition to their source input and form output), then they don't have the simple & expected interface by users of the tool (lib or hand-backed parser, or a mix). You want to write this
        form = pat.match(source)
Not that:
        form, i = pat.match(source, i)

I'd thus need to provide wrapper funcs to get rid of i in both input and output.

Denis
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to