31-Dec-2013 05:51, Brad Anderson пишет:
On Sunday, 29 December 2013 at 22:02:57 UTC, Dmitry Olshansky wrote:
Proposal

Having never written any parser I'm not really qualified to seriously
give comments or review it but it all looks very nice to me.

Speaking as just an end user of these things whenever I use ranges over
files or from, say, std.net.curl the byLine/byChunk interface always
feels terribly awkward to use which often leads to me just giving up and
loading the entire file/resource into an array. It's the boundaries that
I stumble over. byLine never fits when I want to extract something
multiline but byChunk doesn't fit because I often if what I'm searching
for lands on the boundary I'll miss it.

Exactly, the situation is simply not good enough. I can assure you that on the side of parser writers it's even less appealing.


Being able to just do a matchAll() on a file, std.net.curl, etc. without
sacrificing performance and memory would be such a massive gain for
usability.

.. and performance ;)


Just a simple example of where I couldn't figure out how to utilize
either byLine or byChunk without adding some clunky homegrown buffering
solution. This is code that scrapes website titles from the pages of
URLs in IRC messages.
[snip]

I really, really didn't want to use that std.net.curl.get().  It causes
all sorts of problems if someone links to a huge resource.

*Nods*

I just could
not figure out how to utilize byLine (the title regex capture can be
multiline) or byChunk cleanly. Code elegance (a lot of it due to Jakob
Ovrum's help in IRC) was really a goal here as this is just a toy so I
went with get() for the time being but it's always sad to sacrifice
elegance for performance. I certainly didn't want to add some elaborate
evergrowing buffer in the middle of this otherwise clean UFCS chain (and
I'm not even sure how to incrementally regex search the growing buffer
or if that's even possible).

I thought to provide something like that, incremental match that takes pieces of data slice by slice, having to mess with the not-yet-matched kind of object. But it was solving the wrong problem. And it shows that backtracking engines simply can't work like that, they would want to go back to the prior pieces.


If I'm understanding your proposal correctly that get(url) could be
replaced with a hypothetical std.net.curl "buffer range" and that could
be passed directly to matchFirst. It would only take up, at most, the
size of the buffer in memory (which could grow if the capture grows to
be larger than the buffer) and wouldn't read the unneeded portion of the
resource at all. That would be such a huge win for everyone so I'm very
excited about this proposal. It addresses all of my current problems.

That's indeed what the proposal is all about. Glad it makes sense :)



P.S. I love std.regex more and more every day. It made that
entitiesToUni function so easy to implement: http://dpaste.dzfl.pl/688f2e7d

Aye, replace with functor rox!

--
Dmitry Olshansky

Reply via email to