On 2/14/2020 1:48 AM, Nikita Popov wrote:
On Thu, Feb 13, 2020 at 6:06 PM Larry Garfield <la...@garfieldtech.com>
wrote:

On Thu, Feb 13, 2020, at 3:47 AM, Nikita Popov wrote:
Hi internals,

This has been discussed a while ago already, now as a proper proposal:
https://wiki.php.net/rfc/token_as_object

tl;dr is that it allows you to get token_get_all() output as an array of
PhpToken objects. This reduces memory usage, improves performance, makes
code more uniform and readable... What's not to like?

An open question is whether (at least to start with) PhpToken should be
just a data container, or whether we want to add some helper methods to
it.
If this generates too much bikeshed, I'll drop methods from the proposal.

Regards,
Nikita

I love everything about this.

1) I would agree with Nicolas that a static constructor would be better.
I don't know about polyfilling it, but it's definitely more
self-descriptive.

2) I'm skeptical about the methods.  I can see them being useful, but also
being bikeshed material.  For instance, if you're doing annotation parsing
then docblocks are not ignorable.  They're what you're actually looking for.

Two possible additions, feel free to ignore if they're too complicated:

1) Should it return an array of token objects, or a lazy iterable?  If I'm
only interested in certain types (eg, doc strings, classes, etc.) then a
lazy iterable would allow me to string some filter and map operations on to
it and use even less memory overall, since the whole tree is not in memory
at once.


I'm going to take you up on your offer and ignore this one :P Returning
tokens as an iterator is inefficient because it requires full lexer state
backups and restores for each token. Could be optimized, but I wouldn't
bother with it for this feature. I also personally have no use-case for a
lazy token stream. (It's technically sufficient for parsing, but if you
want to preserve formatting, you're going to be preserving all the tokens
anyway.)

Try passing a 10MB PHP file that's all code into token_get_all(). It's pretty easy to hit hard memory limits and/or start crashing PHP when token_get_all() tokenizes the whole thing into a giant array or set of objects. Calling gc_mem_caches() when the previous RAM bits aren't needed anymore helps. Stream-based token parsing would be better for RAM usage but I can see how that might be complex to implement and largely not worth it since such scenarios will be rare and require the ability to maintain lexer state externally as you mentioned and would only be used by this part of the software.

--
Thomas Hruska
CubicleSoft President

I've got great, time saving software that you will find useful.

http://cubiclesoft.com/

And once you find my software useful:

http://cubiclesoft.com/donate/

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to