Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Serhiy Storchaka

06.10.18 10:22, Ram Rachum пише:
I'd like to use the re module to parse a long text file, 1GB in size. I 
wish that the re module could parse a stream, so I wouldn't have to load 
the whole thing into memory. I'd like to iterate over matches from the 
stream without keeping the old matches and input in RAM.


How would you match 'a.*b' without loading the whole thing into memory?

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Ram Rachum
It'll load as much as it needs to in order to match or rule out a match on
a pattern. If you'd try to match `a.*b` it'll load the whole thing. The use
cases that are relevant to a stream wouldn't have these kinds of problems.

On Sat, Oct 6, 2018 at 11:22 AM Serhiy Storchaka 
wrote:

> 06.10.18 10:22, Ram Rachum пише:
> > I'd like to use the re module to parse a long text file, 1GB in size. I
> > wish that the re module could parse a stream, so I wouldn't have to load
> > the whole thing into memory. I'd like to iterate over matches from the
> > stream without keeping the old matches and input in RAM.
>
> How would you match 'a.*b' without loading the whole thing into memory?
>
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Jonathan Fine
Hi Ram

You wrote:

> I'd like to use the re module to parse a long text file, 1GB in size. I
> wish that the re module could parse a stream, so I wouldn't have to load
> the whole thing into memory. I'd like to iterate over matches from the
> stream without keeping the old matches and input in RAM.

This is a regular expression problem, rather than a Python problem. A search for
regular expression large file
brings up some URLs that might help you, starting with
https://stackoverflow.com/questions/23773669/grep-pattern-match-between-very-large-files-is-way-too-slow

This might also be helpful
https://svn.boost.org/trac10/ticket/11776

What will work for your problem depends on the nature of the problem
you have. The simplest thing that might work is to iterate of the file
line-by-line, and use a regular expression to extract matches from
each line.

In other words, something like (not tested)

   def helper(lines):
   for line in lines:
   yield from re.finditer(pattern, line)

lines = open('my-big-file.txt')
for match in helper(lines):
# Do your stuff here

Parsing is not the same as lexing, see
https://en.wikipedia.org/wiki/Lexical_analysis

I suggest you use regular expressions ONLY for the lexing phase. If
you'd like further help, perhaps first ask yourself this. Can the
lexing be done on a line-by-line basis? And if not, why not?

If line-by-line not possible, then you'll have to modify the helper.
At the end of each line, they'll be a residue / remainder, which
you'll have to bring into the next line. In other words, the helper
will have to record (and change) the state that exists at the end of
each line. A bit like the 'carry' that is used when doing long
addition.

I hope this helps.

-- 
Jonathan
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Ram Rachum
"This is a regular expression problem, rather than a Python problem."

Do you have evidence for this assertion, except that other regex
implementations have this limitation? Is there a regex specification
somewhere that specifies that streams aren't supported? Is there a
fundamental reason that streams aren't supported?


"Can the lexing be done on a line-by-line basis?"

For my use case, it unfortunately can't.

On Sat, Oct 6, 2018 at 1:53 PM Jonathan Fine  wrote:

> Hi Ram
>
> You wrote:
>
> > I'd like to use the re module to parse a long text file, 1GB in size. I
> > wish that the re module could parse a stream, so I wouldn't have to load
> > the whole thing into memory. I'd like to iterate over matches from the
> > stream without keeping the old matches and input in RAM.
>
> This is a regular expression problem, rather than a Python problem. A
> search for
> regular expression large file
> brings up some URLs that might help you, starting with
>
> https://stackoverflow.com/questions/23773669/grep-pattern-match-between-very-large-files-is-way-too-slow
>
> This might also be helpful
> https://svn.boost.org/trac10/ticket/11776
>
> What will work for your problem depends on the nature of the problem
> you have. The simplest thing that might work is to iterate of the file
> line-by-line, and use a regular expression to extract matches from
> each line.
>
> In other words, something like (not tested)
>
>def helper(lines):
>for line in lines:
>yield from re.finditer(pattern, line)
>
> lines = open('my-big-file.txt')
> for match in helper(lines):
> # Do your stuff here
>
> Parsing is not the same as lexing, see
> https://en.wikipedia.org/wiki/Lexical_analysis
>
> I suggest you use regular expressions ONLY for the lexing phase. If
> you'd like further help, perhaps first ask yourself this. Can the
> lexing be done on a line-by-line basis? And if not, why not?
>
> If line-by-line not possible, then you'll have to modify the helper.
> At the end of each line, they'll be a residue / remainder, which
> you'll have to bring into the next line. In other words, the helper
> will have to record (and change) the state that exists at the end of
> each line. A bit like the 'carry' that is used when doing long
> addition.
>
> I hope this helps.
>
> --
> Jonathan
>
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Jonathan Fine
I wrote:
> This is a regular expression problem, rather than a Python problem.

Ram wrote:
> Do you have evidence for this assertion, except that
> other regex implementations have this limitation?

Yes.

1. I've already supplied: https://svn.boost.org/trac10/ticket/11776
2. https://en.wikipedia.org/wiki/Regular_expression does not contain
the word 'stream'.

Ram wrote:
> For my use case, it unfortunately can't.

For that case, Ram, I've already given my best advice. If you get a
better idea, please do follow it. And share it with us, so we can all
learn.

-- 
Jonathan
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Ned Batchelder

On 10/6/18 7:25 AM, Ram Rachum wrote:

"This is a regular expression problem, rather than a Python problem."

Do you have evidence for this assertion, except that other regex 
implementations have this limitation? Is there a regex specification 
somewhere that specifies that streams aren't supported? Is there a 
fundamental reason that streams aren't supported?



"Can the lexing be done on a line-by-line basis?"

For my use case, it unfortunately can't.


You mentioned earlier that your use case doesn't have to worry about the 
"a.*b" problem.  Can you tell us more about your scenario?  How would 
the stream know it had read enough to match or not match? Perhaps that 
same logic can be used to feed the data in chunks?


--Ned.



On Sat, Oct 6, 2018 at 1:53 PM Jonathan Fine > wrote:


Hi Ram

You wrote:

> I'd like to use the re module to parse a long text file, 1GB in
size. I
> wish that the re module could parse a stream, so I wouldn't have
to load
> the whole thing into memory. I'd like to iterate over matches
from the
> stream without keeping the old matches and input in RAM.

This is a regular expression problem, rather than a Python
problem. A search for
    regular expression large file
brings up some URLs that might help you, starting with

https://stackoverflow.com/questions/23773669/grep-pattern-match-between-very-large-files-is-way-too-slow

This might also be helpful
https://svn.boost.org/trac10/ticket/11776

What will work for your problem depends on the nature of the problem
you have. The simplest thing that might work is to iterate of the file
line-by-line, and use a regular expression to extract matches from
each line.

In other words, something like (not tested)

   def helper(lines):
       for line in lines:
           yield from re.finditer(pattern, line)

    lines = open('my-big-file.txt')
    for match in helper(lines):
        # Do your stuff here

Parsing is not the same as lexing, see
https://en.wikipedia.org/wiki/Lexical_analysis

I suggest you use regular expressions ONLY for the lexing phase. If
you'd like further help, perhaps first ask yourself this. Can the
lexing be done on a line-by-line basis? And if not, why not?

If line-by-line not possible, then you'll have to modify the helper.
At the end of each line, they'll be a residue / remainder, which
you'll have to bring into the next line. In other words, the helper
will have to record (and change) the state that exists at the end of
each line. A bit like the 'carry' that is used when doing long
addition.

I hope this helps.

-- 
Jonathan




___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Nathaniel Smith
On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum  wrote:
> I'd like to use the re module to parse a long text file, 1GB in size. I wish
> that the re module could parse a stream, so I wouldn't have to load the
> whole thing into memory. I'd like to iterate over matches from the stream
> without keeping the old matches and input in RAM.
>
> What do you think?

This has frustrated me too.

The case where I've encountered this is parsing HTTP/1.1. We have data
coming in incrementally over the network, and we want to find the end
of the headers. To do this, we're looking for the first occurrence of
b"\r\n\r\n" OR b"\n\n".

So our requirements are:

1. Search a bytearray for the regex b"\r\n\r\n|\n\n"
2. If there's no match yet, wait for more data to arrive and try again
3. When more data arrives, start searching again *where the last
search left off*

The last requirement is subtle, but important. The naive approach
would be to rescan your whole receive buffer after each new packet
arrives:

end_of_headers = re.compile(b"\r\n\r\n|\n\n")
while True:
m = end_of_headers.search(receive_buffer)
if m is None:
receive_buffer += await get_more_data_from_network()
# loop around and try again
else:
break

But the code above is quadratic! If the headers are N bytes long, then
on each pass through the loop we perform an O(N) regex search, and we
do O(N) passes through the loop, so the whole thing is O(N**2). That
means your HTTP client-or-server can be trivially DoSed by a peer who
sends their headers broken into lots of small fragments.

Fortunately, there's an elegant and natural solution: Just save the
regex engine's internal state when it hits the end of the string, and
then when more data arrives, use the saved state to pick up the search
where we left off. Theoretically, any regex engine *could* support
this – it's especially obvious for DFA-based matchers, but even
backtrackers like Python's re could support it, basically by making
the matching engine a coroutine that can suspend itself when it hits
the end of the input, then resume it when new input arrives. Like, if
you asked Knuth for the theoretically optimal design for this parser,
I'm pretty sure this is what he'd tell you to use, and it's what
people do when writing high-performance HTTP parsers in C.

But unfortunately, in reality, re *doesn't* support this kind of
pause/resume functionality, and you can't write efficient
character-by-character algorithms in Python, so you have to use really
awkward hacks instead. For the HTTP header case, the best I've been
able to come up with is to manually analyze the regex to figure out
the maximum size string it could match (in this case, 4 bytes), and
then write a loop that tracks how long the string was before the last
time we appended new data, and on each iteration searches the
substring receive_buffer[old_length - 4 + 1:]. This is super finicky,
and especially annoying if you want to offer this as a generic API for
using regexes to deconstruct network streams. (There are a lot of
Python network libraries that have accidentally-quadratic parsers in
them.)

In practice I suspect retrofitting this functionality into 're' would
be a lot of work. But it's definitely frustrating that we have 90% of
the machinery we'd need to do things the natural/efficient way, but
then are thwarted by this arbitrary API limitation.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Chris Angelico
On Sun, Oct 7, 2018 at 8:01 AM Nathaniel Smith  wrote:
>
> On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum  wrote:
> > I'd like to use the re module to parse a long text file, 1GB in size. I wish
> > that the re module could parse a stream, so I wouldn't have to load the
> > whole thing into memory. I'd like to iterate over matches from the stream
> > without keeping the old matches and input in RAM.
> >
> > What do you think?
>
> This has frustrated me too.
>
> The case where I've encountered this is parsing HTTP/1.1. We have data
> coming in incrementally over the network, and we want to find the end
> of the headers. To do this, we're looking for the first occurrence of
> b"\r\n\r\n" OR b"\n\n".
>
> So our requirements are:
>
> 1. Search a bytearray for the regex b"\r\n\r\n|\n\n"
> 2. If there's no match yet, wait for more data to arrive and try again
> 3. When more data arrives, start searching again *where the last
> search left off*
>
> The last requirement is subtle, but important. The naive approach
> would be to rescan your whole receive buffer after each new packet
> arrives:
>
> end_of_headers = re.compile(b"\r\n\r\n|\n\n")
> while True:
> m = end_of_headers.search(receive_buffer)
> if m is None:
> receive_buffer += await get_more_data_from_network()
> # loop around and try again
> else:
> break
>
> But the code above is quadratic! If the headers are N bytes long, then
> on each pass through the loop we perform an O(N) regex search, and we
> do O(N) passes through the loop, so the whole thing is O(N**2). That
> means your HTTP client-or-server can be trivially DoSed by a peer who
> sends their headers broken into lots of small fragments.

Quadratic in the size of the headers only, so you could just cap it -
if the receive buffer is too large, just reject it. Sometimes, the
simplest approach is the best.

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Chris Angelico
On Sun, Oct 7, 2018 at 9:54 AM Nathaniel Smith  wrote:
>
> On Sat, Oct 6, 2018 at 2:04 PM, Chris Angelico  wrote:
> > On Sun, Oct 7, 2018 at 8:01 AM Nathaniel Smith  wrote:
> >>
> >> On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum  wrote:
> >> > I'd like to use the re module to parse a long text file, 1GB in size. I 
> >> > wish
> >> > that the re module could parse a stream, so I wouldn't have to load the
> >> > whole thing into memory. I'd like to iterate over matches from the stream
> >> > without keeping the old matches and input in RAM.
> >> >
> >> > What do you think?
> >>
> >> This has frustrated me too.
> >>
> >> The case where I've encountered this is parsing HTTP/1.1. We have data
> >> coming in incrementally over the network, and we want to find the end
> >> of the headers. To do this, we're looking for the first occurrence of
> >> b"\r\n\r\n" OR b"\n\n".
> >>
> >> So our requirements are:
> >>
> >> 1. Search a bytearray for the regex b"\r\n\r\n|\n\n"
> >> 2. If there's no match yet, wait for more data to arrive and try again
> >> 3. When more data arrives, start searching again *where the last
> >> search left off*
> >>
> >> The last requirement is subtle, but important. The naive approach
> >> would be to rescan your whole receive buffer after each new packet
> >> arrives:
> >>
> >> end_of_headers = re.compile(b"\r\n\r\n|\n\n")
> >> while True:
> >> m = end_of_headers.search(receive_buffer)
> >> if m is None:
> >> receive_buffer += await get_more_data_from_network()
> >> # loop around and try again
> >> else:
> >> break
> >>
> >> But the code above is quadratic! If the headers are N bytes long, then
> >> on each pass through the loop we perform an O(N) regex search, and we
> >> do O(N) passes through the loop, so the whole thing is O(N**2). That
> >> means your HTTP client-or-server can be trivially DoSed by a peer who
> >> sends their headers broken into lots of small fragments.
> >
> > Quadratic in the size of the headers only, so you could just cap it -
> > if the receive buffer is too large, just reject it. Sometimes, the
> > simplest approach is the best.
>
> But OTOH, every problem has a solution that's simple, obvious, and wrong :-).
>
> Of course you set a cap on the header size, to prevent other kinds of
> DoS (e.g. memory exhaustion). But it turns out people stuff a lot of
> data into HTTP headers [1], so if the cap is large enough to support
> non-malicious usage, then it's also large enough to let people DoS the
> naive O(N**2) algorithm. Production-quality HTTP/1.1 parsers really do
> have to use an O(N) algorithm here.

Fair enough! You could instead use a simple linear parser rather than
searching for end of headers first. That way, all you're looking for
is a \n, no regex needed. Sometimes the VERY simplest solution doesn't
work, but I still do like to keep things as simple as possible :)

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Nathaniel Smith
On Sat, Oct 6, 2018 at 2:04 PM, Chris Angelico  wrote:
> On Sun, Oct 7, 2018 at 8:01 AM Nathaniel Smith  wrote:
>>
>> On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum  wrote:
>> > I'd like to use the re module to parse a long text file, 1GB in size. I 
>> > wish
>> > that the re module could parse a stream, so I wouldn't have to load the
>> > whole thing into memory. I'd like to iterate over matches from the stream
>> > without keeping the old matches and input in RAM.
>> >
>> > What do you think?
>>
>> This has frustrated me too.
>>
>> The case where I've encountered this is parsing HTTP/1.1. We have data
>> coming in incrementally over the network, and we want to find the end
>> of the headers. To do this, we're looking for the first occurrence of
>> b"\r\n\r\n" OR b"\n\n".
>>
>> So our requirements are:
>>
>> 1. Search a bytearray for the regex b"\r\n\r\n|\n\n"
>> 2. If there's no match yet, wait for more data to arrive and try again
>> 3. When more data arrives, start searching again *where the last
>> search left off*
>>
>> The last requirement is subtle, but important. The naive approach
>> would be to rescan your whole receive buffer after each new packet
>> arrives:
>>
>> end_of_headers = re.compile(b"\r\n\r\n|\n\n")
>> while True:
>> m = end_of_headers.search(receive_buffer)
>> if m is None:
>> receive_buffer += await get_more_data_from_network()
>> # loop around and try again
>> else:
>> break
>>
>> But the code above is quadratic! If the headers are N bytes long, then
>> on each pass through the loop we perform an O(N) regex search, and we
>> do O(N) passes through the loop, so the whole thing is O(N**2). That
>> means your HTTP client-or-server can be trivially DoSed by a peer who
>> sends their headers broken into lots of small fragments.
>
> Quadratic in the size of the headers only, so you could just cap it -
> if the receive buffer is too large, just reject it. Sometimes, the
> simplest approach is the best.

But OTOH, every problem has a solution that's simple, obvious, and wrong :-).

Of course you set a cap on the header size, to prevent other kinds of
DoS (e.g. memory exhaustion). But it turns out people stuff a lot of
data into HTTP headers [1], so if the cap is large enough to support
non-malicious usage, then it's also large enough to let people DoS the
naive O(N**2) algorithm. Production-quality HTTP/1.1 parsers really do
have to use an O(N) algorithm here.

And similarly, if you're building a generic helper library for people
implementing arbitrary unknown protocols, then you can't assume their
protocols were designed to use small frames only, to avoid hitting
arbitrary limitations in Python's re module.

-n

[1] E.g., 16 KiB total header size is already enough that on my
laptop, the naive O(N**2) algorithm takes ~750 ms CPU time, versus ~16
ms for the O(N) algorithm. HTTP/2 tried hard to simplify their header
encoding scheme by putting limits on header size, but got so much
push-back that they were eventually forced to add special hacks to
allow for arbitrarily large headers – in particular, it turns out that
people use *individual cookies* that are larger than 16 KiB:
https://http2.github.io/faq/#why-the-rules-around-continuation-on-headers-frames

-- 
Nathaniel J. Smith -- https://vorpus.org
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Steven D'Aprano
On Sat, Oct 06, 2018 at 02:00:27PM -0700, Nathaniel Smith wrote:

> Fortunately, there's an elegant and natural solution: Just save the
> regex engine's internal state when it hits the end of the string, and
> then when more data arrives, use the saved state to pick up the search
> where we left off. Theoretically, any regex engine *could* support
> this – it's especially obvious for DFA-based matchers, but even
> backtrackers like Python's re could support it, basically by making
> the matching engine a coroutine that can suspend itself when it hits
> the end of the input, then resume it when new input arrives. Like, if
> you asked Knuth for the theoretically optimal design for this parser,
> I'm pretty sure this is what he'd tell you to use, and it's what
> people do when writing high-performance HTTP parsers in C.

The message I take from this is:

- regex engines certainly can be written to support streaming data;
- but few of them are;
- and it is exceedingly unlikely to be able to easily (or at all) 
  retro-fit that support to Python's existing re module.

Perhaps the solution is a lightweight streaming DFA regex parser?

Does anyone know whether MRAB's regex library supports this?

https://pypi.org/project/regex/


> you can't write efficient
> character-by-character algorithms in Python

I'm sure that Python will never be as efficient as C in that regard 
(although PyPy might argue the point) but is there something we can do 
to ameliorate this? If we could make char-by-char processing only 10 
times less efficient than C instead of 100 times (let's say...) perhaps 
that would help Ram (and you?) with your use-cases?


-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Ram Rachum
Hi Ned! I'm happy to see you here.

I'm doing multi-color 3d-printing. The slicing software generates a GCode
file, which is a text file of instructions for the printer, each command
meaning something like "move the head to coordinates x,y,z while extruding
plastic at a rate of w" and lots of other administrative commands. (Turn
the print fan on, heat the extruder to a certain temperature, calibrate the
printer, etc.)

Here's an example of a simple GCode from a print I did last week:
https://www.dropbox.com/s/kzmm6v8ilcn0aik/JPL%20Go%20hook.gcode?dl=0

It's 1.8MB in size. They could get to 1GB for complex prints.

Multi-color prints means that at some points in the print, usually in a
layer change, I'm changing the color. This means I need to insert an M600
command, which tells the printer to pause the print, move the head around,
and give me a prompt to change the filament before continuing printing.

I'm sick of injecting the M600 manually after every print. I've been doing
that for the last year. I'm working on a script so I could say "Insert an
M600 command after layers 5, 88 and 234, and also before process Foo."

The slicer inserts comments saying "; layer 234" Or "; process Foo". I want
to identify the entire layer as one match. That's because I want to find
the layer and possibly process at the start, I want to find the last
retraction command, the first extrusion command in the new layer, etc. So
it's a regex that spans potentially thousands of lines.

Then I'll know just where to put my M600 and how much retraction to do
afterwards.


Thanks,
Ram.


On Sat, Oct 6, 2018 at 6:58 PM Ned Batchelder  wrote:

> On 10/6/18 7:25 AM, Ram Rachum wrote:
>
> "This is a regular expression problem, rather than a Python problem."
>
> Do you have evidence for this assertion, except that other regex
> implementations have this limitation? Is there a regex specification
> somewhere that specifies that streams aren't supported? Is there a
> fundamental reason that streams aren't supported?
>
>
> "Can the lexing be done on a line-by-line basis?"
>
> For my use case, it unfortunately can't.
>
>
> You mentioned earlier that your use case doesn't have to worry about the
> "a.*b" problem.  Can you tell us more about your scenario?  How would the
> stream know it had read enough to match or not match?  Perhaps that same
> logic can be used to feed the data in chunks?
>
> --Ned.
>
>
> On Sat, Oct 6, 2018 at 1:53 PM Jonathan Fine  wrote:
>
>> Hi Ram
>>
>> You wrote:
>>
>> > I'd like to use the re module to parse a long text file, 1GB in size. I
>> > wish that the re module could parse a stream, so I wouldn't have to load
>> > the whole thing into memory. I'd like to iterate over matches from the
>> > stream without keeping the old matches and input in RAM.
>>
>> This is a regular expression problem, rather than a Python problem. A
>> search for
>> regular expression large file
>> brings up some URLs that might help you, starting with
>>
>> https://stackoverflow.com/questions/23773669/grep-pattern-match-between-very-large-files-is-way-too-slow
>>
>> This might also be helpful
>> https://svn.boost.org/trac10/ticket/11776
>>
>> What will work for your problem depends on the nature of the problem
>> you have. The simplest thing that might work is to iterate of the file
>> line-by-line, and use a regular expression to extract matches from
>> each line.
>>
>> In other words, something like (not tested)
>>
>>def helper(lines):
>>for line in lines:
>>yield from re.finditer(pattern, line)
>>
>> lines = open('my-big-file.txt')
>> for match in helper(lines):
>> # Do your stuff here
>>
>> Parsing is not the same as lexing, see
>> https://en.wikipedia.org/wiki/Lexical_analysis
>>
>> I suggest you use regular expressions ONLY for the lexing phase. If
>> you'd like further help, perhaps first ask yourself this. Can the
>> lexing be done on a line-by-line basis? And if not, why not?
>>
>> If line-by-line not possible, then you'll have to modify the helper.
>> At the end of each line, they'll be a residue / remainder, which
>> you'll have to bring into the next line. In other words, the helper
>> will have to record (and change) the state that exists at the end of
>> each line. A bit like the 'carry' that is used when doing long
>> addition.
>>
>> I hope this helps.
>>
>> --
>> Jonathan
>>
>>
>
> ___
> Python-ideas mailing 
> listPython-ideas@python.orghttps://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
>
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://py

Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Ram Rachum
On Sun, Oct 7, 2018 at 4:40 AM Steven D'Aprano  wrote:

> I'm sure that Python will never be as efficient as C in that regard
> (although PyPy might argue the point) but is there something we can do
> to ameliorate this? If we could make char-by-char processing only 10
> times less efficient than C instead of 100 times (let's say...) perhaps
> that would help Ram (and you?) with your use-cases?
>
>
Does that mean I'll have to write that character-by-character algorithm? I
could already do that now I guess, the speed doesn't matter for my use
case, but I'm trying to avoid writing an algorithm.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-06 Thread Cameron Simpson

On 07Oct2018 07:30, Ram Rachum  wrote:

I'm doing multi-color 3d-printing. The slicing software generates a GCode
file, which is a text file of instructions for the printer, each command
meaning something like "move the head to coordinates x,y,z while extruding
plastic at a rate of w" and lots of other administrative commands. (Turn
the print fan on, heat the extruder to a certain temperature, calibrate the
printer, etc.)

Here's an example of a simple GCode from a print I did last week:
https://www.dropbox.com/s/kzmm6v8ilcn0aik/JPL%20Go%20hook.gcode?dl=0

It's 1.8MB in size. They could get to 1GB for complex prints.

Multi-color prints means that at some points in the print, usually in a
layer change, I'm changing the color. This means I need to insert an M600
command, which tells the printer to pause the print, move the head around,
and give me a prompt to change the filament before continuing printing.

I'm sick of injecting the M600 manually after every print. I've been doing
that for the last year. I'm working on a script so I could say "Insert an
M600 command after layers 5, 88 and 234, and also before process Foo."

The slicer inserts comments saying "; layer 234" Or "; process Foo". I want
to identify the entire layer as one match. That's because I want to find
the layer and possibly process at the start, I want to find the last
retraction command, the first extrusion command in the new layer, etc. So
it's a regex that spans potentially thousands of lines.

Then I'll know just where to put my M600 and how much retraction to do
afterwards.


Aha.

Yeah, don't use a regexp for "the whole layer". I've fetched your file, and it 
is one instruction or comment per line. This is _easy_ to parse. Consider this 
totally untested sketch:


 layer_re = re.compile('^; layer (\d+), Z = (.*)')
 with open("JPL.gcode") as gcode:
   current_layer = None
   for lineno, line in enumerate(gcode, 1):
 m = layer_re.match(line)
 if m:
   # new layer
   new_layer = int(m.group(1))
   new_z = float(m.group(2))
   if current_layer is not None:
 # process the saved previous layer
 ..
   current_layer = new_layer
   accrued_layer = []
 if current_layer is not None:
   # we're saving lines for later work
   accrued_layer.append(line)
   continue
 # otherwise, copy the line straight out
 sys.stdout.write(line)

The idea is that you scan the data on a per-line basis, adjusting some state 
variables as you see important lines.  If you're "saving" a chunk of lines such 
as the instructions in a layer (in the above code: "current_layer is not None")

you can stuff just those lines into a list for use when complete.

On changes of state you deal with what you may have saved, etc.

But just looking at your examples, you may not need to save anything; just 
insert or append lines during the copy. Example:


 with open("JPL.gcode") as gcode:
   for lineno, line in enumerate(gcode, 1):
 # pre line actions
 if line.startswith('; process '):
   print("M600 instruction...")
 # copy out the line
 sys.stdout.write(line)
 # post line actions
 if ...

So you don't need to apply a regexp to a huge chunk of file. Process the file 
on an instruction basis and insert/append your extra instructions as you see 
the boundaries of the code you're after.


A minor note. This incantation:

 for lineno, line in enumerate(gcode, 1):

is to make it easy to print error message which recite the file line number to 
aid debugging. If you don't need that you'd just run with:


 for line in gcode:

Cheers,
Cameron Simpson 
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Cameron Simpson

On 07Oct2018 07:32, Ram Rachum  wrote:

On Sun, Oct 7, 2018 at 4:40 AM Steven D'Aprano  wrote:

I'm sure that Python will never be as efficient as C in that regard
(although PyPy might argue the point) but is there something we can do
to ameliorate this? If we could make char-by-char processing only 10
times less efficient than C instead of 100 times (let's say...) perhaps
that would help Ram (and you?) with your use-cases?


Does that mean I'll have to write that character-by-character algorithm? I
could already do that now I guess, the speed doesn't matter for my use
case, but I'm trying to avoid writing an algorithm.


No. You can write a line by line loop. See my other post.

Cheers,
Cameron Simpson 
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Nathaniel Smith
On Sat, Oct 6, 2018, 18:40 Steven D'Aprano  wrote:

> The message I take from this is:
>
> - regex engines certainly can be written to support streaming data;
> - but few of them are;
> - and it is exceedingly unlikely to be able to easily (or at all)
>   retro-fit that support to Python's existing re module.
>

I don't know enough about the re module internals to make an informed guess
about the difficulty.

On a quick glance, it does seem to store most intermediate match state in
explicit structs rather than on the call stack...

-n

>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Ram Rachum
Hi Cameron,

Thanks for putting in the time to study my problem and sketch a solution.

Unfortunately, it's not helpful. I was developing a solution similar to
yours before I came to the conclusion that a multilne regex would be more
elegant. I find this algorithm to be quite complicated. It's basically a
poor man's regex engine.

I'm more likely to use a shim to make the re package work on streams (like
regexy or reading chunks until I get a match) than to use an algorithm like
that.


Thanks,
Ram.

On Sun, Oct 7, 2018 at 9:58 AM Cameron Simpson  wrote:

> On 07Oct2018 07:30, Ram Rachum  wrote:
> >I'm doing multi-color 3d-printing. The slicing software generates a GCode
> >file, which is a text file of instructions for the printer, each command
> >meaning something like "move the head to coordinates x,y,z while extruding
> >plastic at a rate of w" and lots of other administrative commands. (Turn
> >the print fan on, heat the extruder to a certain temperature, calibrate
> the
> >printer, etc.)
> >
> >Here's an example of a simple GCode from a print I did last week:
> >https://www.dropbox.com/s/kzmm6v8ilcn0aik/JPL%20Go%20hook.gcode?dl=0
> >
> >It's 1.8MB in size. They could get to 1GB for complex prints.
> >
> >Multi-color prints means that at some points in the print, usually in a
> >layer change, I'm changing the color. This means I need to insert an M600
> >command, which tells the printer to pause the print, move the head around,
> >and give me a prompt to change the filament before continuing printing.
> >
> >I'm sick of injecting the M600 manually after every print. I've been doing
> >that for the last year. I'm working on a script so I could say "Insert an
> >M600 command after layers 5, 88 and 234, and also before process Foo."
> >
> >The slicer inserts comments saying "; layer 234" Or "; process Foo". I
> want
> >to identify the entire layer as one match. That's because I want to find
> >the layer and possibly process at the start, I want to find the last
> >retraction command, the first extrusion command in the new layer, etc. So
> >it's a regex that spans potentially thousands of lines.
> >
> >Then I'll know just where to put my M600 and how much retraction to do
> >afterwards.
>
> Aha.
>
> Yeah, don't use a regexp for "the whole layer". I've fetched your file,
> and it
> is one instruction or comment per line. This is _easy_ to parse. Consider
> this
> totally untested sketch:
>
>   layer_re = re.compile('^; layer (\d+), Z = (.*)')
>   with open("JPL.gcode") as gcode:
> current_layer = None
> for lineno, line in enumerate(gcode, 1):
>   m = layer_re.match(line)
>   if m:
> # new layer
> new_layer = int(m.group(1))
> new_z = float(m.group(2))
> if current_layer is not None:
>   # process the saved previous layer
>   ..
> current_layer = new_layer
> accrued_layer = []
>   if current_layer is not None:
> # we're saving lines for later work
> accrued_layer.append(line)
> continue
>   # otherwise, copy the line straight out
>   sys.stdout.write(line)
>
> The idea is that you scan the data on a per-line basis, adjusting some
> state
> variables as you see important lines.  If you're "saving" a chunk of lines
> such
> as the instructions in a layer (in the above code: "current_layer is not
> None")
> you can stuff just those lines into a list for use when complete.
>
> On changes of state you deal with what you may have saved, etc.
>
> But just looking at your examples, you may not need to save anything; just
> insert or append lines during the copy. Example:
>
>   with open("JPL.gcode") as gcode:
> for lineno, line in enumerate(gcode, 1):
>   # pre line actions
>   if line.startswith('; process '):
> print("M600 instruction...")
>   # copy out the line
>   sys.stdout.write(line)
>   # post line actions
>   if ...
>
> So you don't need to apply a regexp to a huge chunk of file. Process the
> file
> on an instruction basis and insert/append your extra instructions as you
> see
> the boundaries of the code you're after.
>
> A minor note. This incantation:
>
>   for lineno, line in enumerate(gcode, 1):
>
> is to make it easy to print error message which recite the file line
> number to
> aid debugging. If you don't need that you'd just run with:
>
>   for line in gcode:
>
> Cheers,
> Cameron Simpson 
>
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread 2015

On 18-10-07 15.11, Ram Rachum wrote:

> Unfortunately, it's not helpful. I was developing a solution similar 
to yours before I came to the conclusion that a multilne regex would be 
more elegant.



How about memory mapping your 1GB file?

bytes patterns work on memoryviews.

regards, Anders


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Ram Rachum
I tested it now and indeed bytes patterns work on memoryview objects. But
how do I use this to scan for patterns through a stream without loading it
to memory?

On Sun, Oct 7, 2018 at 4:24 PM <2...@jmunch.dk> wrote:

> On 18-10-07 15.11, Ram Rachum wrote:
>
>  > Unfortunately, it's not helpful. I was developing a solution similar
> to yours before I came to the conclusion that a multilne regex would be
> more elegant.
>
>
> How about memory mapping your 1GB file?
>
> bytes patterns work on memoryviews.
>
> regards, Anders
>
>
>
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread 2015

On 18-10-07 16.15, Ram Rachum wrote:
> I tested it now and indeed bytes patterns work on memoryview objects.
> But how do I use this to scan for patterns through a stream without
> loading it to memory?

An mmap object is one of the things you can make a memoryview of,
although looking again, it seems you don't even need to, you can
just re.search the mmap object directly.

re.search'ing the mmap object means the operating system takes care of
the streaming for you, reading in parts of the file only as necessary.

regards, Anders

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Jonathan Fine
Anders wrote

> An mmap object is one of the things you can make a memoryview of,
> although looking again, it seems you don't even need to, you can
> just re.search the mmap object directly.
>
> re.search'ing the mmap object means the operating system takes care of
> the streaming for you, reading in parts of the file only as necessary.

Provided mmap releases memory when possible, this is a good solution,
and probably in many cases better than the one I suggested. And closer
to what the original poster asked for.

So I've learnt something. Thank you all.

-- 
Jonathan
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Greg Ewing

Jonathan Fine wrote:


Provided mmap releases memory when possible,


It will. The virtual memory system will read pages from the
file into RAM when needed, and re-use those RAM pages for other
purposes when needed. It should be pretty much the most
efficient solution possible.

--
Greg
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Terry Reedy

On 10/6/2018 5:00 PM, Nathaniel Smith wrote:

On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum  wrote:

I'd like to use the re module to parse a long text file, 1GB in size. I wish
that the re module could parse a stream, so I wouldn't have to load the
whole thing into memory. I'd like to iterate over matches from the stream
without keeping the old matches and input in RAM.

What do you think?


This has frustrated me too.

The case where I've encountered this is parsing HTTP/1.1. We have data
coming in incrementally over the network, and we want to find the end
of the headers. To do this, we're looking for the first occurrence of
b"\r\n\r\n" OR b"\n\n".

So our requirements are:

1. Search a bytearray for the regex b"\r\n\r\n|\n\n"


I believe that re is both overkill and slow for this particular problem.
For O(n), search forward for \n with str.index('\n') (or .find)
[I assume that this searches forward faster than
for i, c in enumerate(s):
   if c == '\n': break
and leave you to test this.]

If not found, continue with next chunk of data.
If found, look back for \r to determine whether to look forward for \n 
or \r\n *whenever there is enough data to do so.



2. If there's no match yet, wait for more data to arrive and try again
3. When more data arrives, start searching again *where the last
search left off*


s.index has an optional start parameter.  And keep chunks in a list 
until you have a match and can join all at once.



--
Terry Jan Reedy

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Terry Reedy

On 10/7/2018 12:32 AM, Ram Rachum wrote:

Does that mean I'll have to write that character-by-character algorithm? 


I would not be surprised if you could make use of str.index, which scans 
at C speed.  See my answer to Nathaniel.



--
Terry Jan Reedy

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Nathaniel Smith
On Sun, Oct 7, 2018 at 5:09 PM, Terry Reedy  wrote:
> On 10/6/2018 5:00 PM, Nathaniel Smith wrote:
>>
>> On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum  wrote:
>>>
>>> I'd like to use the re module to parse a long text file, 1GB in size. I
>>> wish
>>> that the re module could parse a stream, so I wouldn't have to load the
>>> whole thing into memory. I'd like to iterate over matches from the stream
>>> without keeping the old matches and input in RAM.
>>>
>>> What do you think?
>>
>>
>> This has frustrated me too.
>>
>> The case where I've encountered this is parsing HTTP/1.1. We have data
>> coming in incrementally over the network, and we want to find the end
>> of the headers. To do this, we're looking for the first occurrence of
>> b"\r\n\r\n" OR b"\n\n".
>>
>> So our requirements are:
>>
>> 1. Search a bytearray for the regex b"\r\n\r\n|\n\n"
>
>
> I believe that re is both overkill and slow for this particular problem.
> For O(n), search forward for \n with str.index('\n') (or .find)
> [I assume that this searches forward faster than
> for i, c in enumerate(s):
>if c == '\n': break
> and leave you to test this.]
>
> If not found, continue with next chunk of data.
> If found, look back for \r to determine whether to look forward for \n or
> \r\n *whenever there is enough data to do so.

Are you imagining something roughly like this? (Ignoring chunk
boundary handling for the moment.)

def find_double_line_end(buf):
start = 0
while True:
next_idx = buf.index(b"\n", start)
if buf[next_idx - 1:next_idx + 1] == b"\n" or buf[next_idx -
3:next_idx] == b"\r\n\r":
return next_idx
start = next_idx + 1

That's much more complicated than using re.search, and on some random
HTTP headers I have lying around it benchmarks ~70% slower too. Which
makes sense, since we're basically trying to replicate re engine's
work by hand in a slower language.

BTW, if we only want to find a fixed string like b"\r\n\r\n", then
re.search and bytearray.index are almost identical in speed. If you
have a problem that can be expressed as a regular expression, then
regular expression engines are actually pretty good at solving those
:-)

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-07 Thread Nathaniel Smith
On Sun, Oct 7, 2018 at 5:54 PM, Nathaniel Smith  wrote:
> Are you imagining something roughly like this? (Ignoring chunk
> boundary handling for the moment.)
>
> def find_double_line_end(buf):
> start = 0
> while True:
> next_idx = buf.index(b"\n", start)
> if buf[next_idx - 1:next_idx + 1] == b"\n" or buf[next_idx -
> 3:next_idx] == b"\r\n\r":
> return next_idx
> start = next_idx + 1
>
> That's much more complicated than using re.search, and on some random
> HTTP headers I have lying around it benchmarks ~70% slower too. Which
> makes sense, since we're basically trying to replicate re engine's
> work by hand in a slower language.
>
> BTW, if we only want to find a fixed string like b"\r\n\r\n", then
> re.search and bytearray.index are almost identical in speed. If you
> have a problem that can be expressed as a regular expression, then
> regular expression engines are actually pretty good at solving those
> :-)

Though... here's something strange.

Here's another way to search for the first appearance of either
\r\n\r\n or \n\n in a bytearray:

def find_double_line_end_2(buf):
idx1 = buf.find(b"\r\n\r\n")
idx2 = buf.find(b"\n\n", 0, idx1)
if idx1 == -1:
return idx2
elif idx2 == -1:
return idx1
else:
return min(idx1, idx2)

So this is essentially equivalent to our regex (notice they both pick
out position 505 as the end of the headers):

In [52]: find_double_line_end_2(sample_headers)
Out[52]: 505

In [53]: double_line_end_re = re.compile(b"\r\n\r\n|\n\n")

In [54]: double_line_end_re.search(sample_headers)
Out[54]: <_sre.SRE_Match object; span=(505, 509), match=b'\r\n\r\n'>

But, the Python function that calls bytearray.find twice is about ~3x
faster than the re module:

In [55]: %timeit find_double_line_end_2(sample_headers)
1.18 µs ± 40 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [56]: %timeit double_line_end_re.search(sample_headers)
3.3 µs ± 23.9 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)

The regex module is even slower:

In [57]: double_line_end_regex = regex.compile(b"\r\n\r\n|\n\n")

In [58]: %timeit double_line_end_regex.search(sample_headers)
4.95 µs ± 76.4 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread Ram Rachum
That's incredibly interesting. I've never used mmap before.

However, there's a problem.

I did a few experiments with mmap now, this is the latest:

path = pathlib.Path(r'P:\huge_file')

with path.open('r') as file:
mmap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
for match in re.finditer(b'.', mmap):
pass

The file is 338GB in size, and it seems that Python is trying to load it
into memory. The process is now taking 4GB RAM and it's growing. I saw the
same behavior when searching for a non-existing match.

Should I open a Python bug for this?

On Sun, Oct 7, 2018 at 7:49 PM <2...@jmunch.dk> wrote:

> On 18-10-07 16.15, Ram Rachum wrote:
>  > I tested it now and indeed bytes patterns work on memoryview objects.
>  > But how do I use this to scan for patterns through a stream without
>  > loading it to memory?
>
> An mmap object is one of the things you can make a memoryview of,
> although looking again, it seems you don't even need to, you can
> just re.search the mmap object directly.
>
> re.search'ing the mmap object means the operating system takes care of
> the streaming for you, reading in parts of the file only as necessary.
>
> regards, Anders
>
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread Cameron Simpson

On 08Oct2018 10:56, Ram Rachum  wrote:

That's incredibly interesting. I've never used mmap before.
However, there's a problem.
I did a few experiments with mmap now, this is the latest:

path = pathlib.Path(r'P:\huge_file')

with path.open('r') as file:
   mmap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)


Just a remark: don't tromp on the "mmap" name. Maybe "mapped"?


   for match in re.finditer(b'.', mmap):
   pass

The file is 338GB in size, and it seems that Python is trying to load it
into memory. The process is now taking 4GB RAM and it's growing. I saw the
same behavior when searching for a non-existing match.

Should I open a Python bug for this?


Probably not. First figure out what is going on. BTW, how much RAM have you 
got?


As you access the mapped file the OS will try to keep it in memory in case you 
need that again. In the absense of competition, most stuff will get paged out 
to accomodate it. That's normal. All the data are "clean" (unmodified) so the 
OS can simply release the older pages instantly if something else needs the 
RAM.


However, another possibility is the the regexp is consuming lots of memory.

The regexp seems simple enough (b'.'), so I doubt it is leaking memory like 
mad; I'm guessing you're just seeing the OS page in as much of the file as it 
can.


Also, does the loop iterate? i.e. does it find multiple matches as the memory 
gets consumed, or is the first iateration blocking and consuming gobs of memory 
before the first match comes back? A print() call will tell you that.


Cheers,
Cameron Simpson 
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread Ram Rachum
I'm not an expert on memory. I used Process Explorer to look at the
Process. The Working Set of the current run is 11GB. The Private Bytes is
708MB. Actually, see all the info here:
https://www.dropbox.com/s/tzoud028pzdkfi7/screenshot_TURING_2018-10-08_133355.jpg?dl=0

I've got 16GB of RAM on this computer, and Process Explorer says it's
almost full, just ~150MB left. This is physical memory.

To your question: The loop does iterate, i.e. finding multiple matches.

On Mon, Oct 8, 2018 at 1:20 PM Cameron Simpson  wrote:

> On 08Oct2018 10:56, Ram Rachum  wrote:
> >That's incredibly interesting. I've never used mmap before.
> >However, there's a problem.
> >I did a few experiments with mmap now, this is the latest:
> >
> >path = pathlib.Path(r'P:\huge_file')
> >
> >with path.open('r') as file:
> >mmap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
>
> Just a remark: don't tromp on the "mmap" name. Maybe "mapped"?
>
> >for match in re.finditer(b'.', mmap):
> >pass
> >
> >The file is 338GB in size, and it seems that Python is trying to load it
> >into memory. The process is now taking 4GB RAM and it's growing. I saw the
> >same behavior when searching for a non-existing match.
> >
> >Should I open a Python bug for this?
>
> Probably not. First figure out what is going on. BTW, how much RAM have
> you
> got?
>
> As you access the mapped file the OS will try to keep it in memory in case
> you
> need that again. In the absense of competition, most stuff will get paged
> out
> to accomodate it. That's normal. All the data are "clean" (unmodified) so
> the
> OS can simply release the older pages instantly if something else needs
> the
> RAM.
>
> However, another possibility is the the regexp is consuming lots of memory.
>
> The regexp seems simple enough (b'.'), so I doubt it is leaking memory
> like
> mad; I'm guessing you're just seeing the OS page in as much of the file as
> it
> can.
>
> Also, does the loop iterate? i.e. does it find multiple matches as the
> memory
> gets consumed, or is the first iateration blocking and consuming gobs of
> memory
> before the first match comes back? A print() call will tell you that.
>
> Cheers,
> Cameron Simpson 
>
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread Erik Bray
On Mon, Oct 8, 2018 at 12:20 PM Cameron Simpson  wrote:
>
> On 08Oct2018 10:56, Ram Rachum  wrote:
> >That's incredibly interesting. I've never used mmap before.
> >However, there's a problem.
> >I did a few experiments with mmap now, this is the latest:
> >
> >path = pathlib.Path(r'P:\huge_file')
> >
> >with path.open('r') as file:
> >mmap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
>
> Just a remark: don't tromp on the "mmap" name. Maybe "mapped"?
>
> >for match in re.finditer(b'.', mmap):
> >pass
> >
> >The file is 338GB in size, and it seems that Python is trying to load it
> >into memory. The process is now taking 4GB RAM and it's growing. I saw the
> >same behavior when searching for a non-existing match.
> >
> >Should I open a Python bug for this?
>
> Probably not. First figure out what is going on. BTW, how much RAM have you
> got?
>
> As you access the mapped file the OS will try to keep it in memory in case you
> need that again. In the absense of competition, most stuff will get paged out
> to accomodate it. That's normal. All the data are "clean" (unmodified) so the
> OS can simply release the older pages instantly if something else needs the
> RAM.
>
> However, another possibility is the the regexp is consuming lots of memory.
>
> The regexp seems simple enough (b'.'), so I doubt it is leaking memory like
> mad; I'm guessing you're just seeing the OS page in as much of the file as it
> can.

Yup. Windows will aggressively fill up your RAM in cases like this
because after all why not?  There's no use to having memory just
sitting around unused.  For read-only, non-anonymous mappings it's not
much problem for the OS to drop pages that haven't been recently
accessed and use them for something else.  So I wouldn't be too
worried about the process chewing up RAM.

I feel like this is veering more into python-list territory for
further discussion though.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread Ram Rachum
" Windows will aggressively fill up your RAM in cases like this because
after all why not?  There's no use to having memory just sitting around
unused."

Two questions:

1. Is the "why not" sarcastic, as in you're agreeing it's a waste?
2. Will this be different on Linux? Which command do I run on Linux to
verify that the process isn't taking too much RAM?


Thanks,
Ram.


On Mon, Oct 8, 2018 at 3:02 PM Erik Bray  wrote:

> On Mon, Oct 8, 2018 at 12:20 PM Cameron Simpson  wrote:
> >
> > On 08Oct2018 10:56, Ram Rachum  wrote:
> > >That's incredibly interesting. I've never used mmap before.
> > >However, there's a problem.
> > >I did a few experiments with mmap now, this is the latest:
> > >
> > >path = pathlib.Path(r'P:\huge_file')
> > >
> > >with path.open('r') as file:
> > >mmap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
> >
> > Just a remark: don't tromp on the "mmap" name. Maybe "mapped"?
> >
> > >for match in re.finditer(b'.', mmap):
> > >pass
> > >
> > >The file is 338GB in size, and it seems that Python is trying to load it
> > >into memory. The process is now taking 4GB RAM and it's growing. I saw
> the
> > >same behavior when searching for a non-existing match.
> > >
> > >Should I open a Python bug for this?
> >
> > Probably not. First figure out what is going on. BTW, how much RAM have
> you
> > got?
> >
> > As you access the mapped file the OS will try to keep it in memory in
> case you
> > need that again. In the absense of competition, most stuff will get
> paged out
> > to accomodate it. That's normal. All the data are "clean" (unmodified)
> so the
> > OS can simply release the older pages instantly if something else needs
> the
> > RAM.
> >
> > However, another possibility is the the regexp is consuming lots of
> memory.
> >
> > The regexp seems simple enough (b'.'), so I doubt it is leaking memory
> like
> > mad; I'm guessing you're just seeing the OS page in as much of the file
> as it
> > can.
>
> Yup. Windows will aggressively fill up your RAM in cases like this
> because after all why not?  There's no use to having memory just
> sitting around unused.  For read-only, non-anonymous mappings it's not
> much problem for the OS to drop pages that haven't been recently
> accessed and use them for something else.  So I wouldn't be too
> worried about the process chewing up RAM.
>
> I feel like this is veering more into python-list territory for
> further discussion though.
>
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread Anders Hovmöller

>> However, another possibility is the the regexp is consuming lots of memory.
>> 
>> The regexp seems simple enough (b'.'), so I doubt it is leaking memory like
>> mad; I'm guessing you're just seeing the OS page in as much of the file as it
>> can.
> 
> Yup. Windows will aggressively fill up your RAM in cases like this
> because after all why not?  There's no use to having memory just
> sitting around unused.  For read-only, non-anonymous mappings it's not
> much problem for the OS to drop pages that haven't been recently
> accessed and use them for something else.  So I wouldn't be too
> worried about the process chewing up RAM.
> 
> I feel like this is veering more into python-list territory for
> further discussion though.

Last time I worked on windows, which admittedly was a long time, the file cache 
was not attributed to a process, so this doesn't seem to be relevant to this 
situation. 

/ Anders___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread Richard Damon
On 10/8/18 8:11 AM, Ram Rachum wrote:
> " Windows will aggressively fill up your RAM in cases like this
> because after all why not?  There's no use to having memory just
> sitting around unused."
>
> Two questions:
>
> 1. Is the "why not" sarcastic, as in you're agreeing it's a waste?
> 2. Will this be different on Linux? Which command do I run on Linux to
> verify that the process isn't taking too much RAM?
>
>
> Thanks,
> Ram.
I would say the 'why not' isn't being sarcastic but pragmatic. (And I
would expect Linux to work similarly). After all if you have a system
with X amount of memory, and total memory demand for the other processes
is 10% of X, what is the issue with letting one process use 80% of X
with memory usages that is easy to clear out if something else wants it.
A read only page that is already backed on the disk is trivial to make
available for another usage.

Memory just sitting idle is the real waste.

-- 
Richard Damon

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread Ram Rachum
Thanks for your help everybody! I'm very happy to have learned about mmap.

On Mon, Oct 8, 2018 at 3:27 PM Richard Damon 
wrote:

> On 10/8/18 8:11 AM, Ram Rachum wrote:
> > " Windows will aggressively fill up your RAM in cases like this
> > because after all why not?  There's no use to having memory just
> > sitting around unused."
> >
> > Two questions:
> >
> > 1. Is the "why not" sarcastic, as in you're agreeing it's a waste?
> > 2. Will this be different on Linux? Which command do I run on Linux to
> > verify that the process isn't taking too much RAM?
> >
> >
> > Thanks,
> > Ram.
> I would say the 'why not' isn't being sarcastic but pragmatic. (And I
> would expect Linux to work similarly). After all if you have a system
> with X amount of memory, and total memory demand for the other processes
> is 10% of X, what is the issue with letting one process use 80% of X
> with memory usages that is easy to clear out if something else wants it.
> A read only page that is already backed on the disk is trivial to make
> available for another usage.
>
> Memory just sitting idle is the real waste.
>
> --
> Richard Damon
>
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread Chris Angelico
On Mon, Oct 8, 2018 at 11:15 PM Anders Hovmöller  wrote:
>
>
> However, another possibility is the the regexp is consuming lots of memory.
>
> The regexp seems simple enough (b'.'), so I doubt it is leaking memory like
> mad; I'm guessing you're just seeing the OS page in as much of the file as it
> can.
>
>
> Yup. Windows will aggressively fill up your RAM in cases like this
> because after all why not?  There's no use to having memory just
> sitting around unused.  For read-only, non-anonymous mappings it's not
> much problem for the OS to drop pages that haven't been recently
> accessed and use them for something else.  So I wouldn't be too
> worried about the process chewing up RAM.
>
> I feel like this is veering more into python-list territory for
> further discussion though.
>
>
> Last time I worked on windows, which admittedly was a long time, the file 
> cache was not attributed to a process, so this doesn't seem to be relevant to 
> this situation.

Depends whether it's a file cache or a memory-mapped file, though. On
Linux, if I open a file, read it, then close it, I'm not using that
file any more, but it might remain in cache (which will mean that
re-reading it will be fast, regardless of whether that's from the same
or a different process). That usage shows up as either "buffers" or
"cache", and doesn't belong to any process.

In contrast, a mmap'd file is memory that you do indeed own. If the
system runs short of physical memory, it can simply discard those
pages (rather than saving them to the swap file), but they're still
owned by one specific process, and should count in that process's
virtual memory.

(That's based on my knowledge of Linux today and OS/2 back in the 90s.
It may or may not be accurate to Windows, but I suspect it won't be
very far wrong.)

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-08 Thread cs

On 08Oct2018 13:36, Ram Rachum  wrote:

I'm not an expert on memory. I used Process Explorer to look at the
Process. The Working Set of the current run is 11GB. The Private Bytes is
708MB. Actually, see all the info here:
https://www.dropbox.com/s/tzoud028pzdkfi7/screenshot_TURING_2018-10-08_133355.jpg?dl=0


And the process' virtual size is about 353GB, which matches having your file 
mmaped (its contents is now part of your process virtual memory space).



I've got 16GB of RAM on this computer, and Process Explorer says it's
almost full, just ~150MB left. This is physical memory.


I'd say this is expected behaviour. As you access the memory it is paged into 
physical memory, and since it may be wanted again (the OS can't tell) it isn't 
paged out until that becomes necessary to make room for other virtual pages.


I suspect (but would need to test to find out) that sequentially reading the 
file instead of memory mapping it might not be so aggressive because your 
process would be reusing that same small pool of memory to hold data as you 
scan the file.


Cheers,
Cameron Simpson 
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-09 Thread Greg Ewing

Chris Angelico wrote:

In contrast, a mmap'd file is memory that you do indeed own.


Although it's not really accurate to say that it's owned by
a particular process. If two processes mmap the same file,
the physical memory pages holding it appear in the address
spaces of both processes.

--
Greg


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-09 Thread Chris Angelico
On Tue, Oct 9, 2018 at 10:05 PM Greg Ewing  wrote:
>
> Chris Angelico wrote:
> > In contrast, a mmap'd file is memory that you do indeed own.
>
> Although it's not really accurate to say that it's owned by
> a particular process. If two processes mmap the same file,
> the physical memory pages holding it appear in the address
> spaces of both processes.
>

Yeah, which is a constant problem when you ask "why am I out of
memory". Same thing happens if you have any other (non-mmap'd) pages
and then you fork, or if two processes are using the same executable
(though I think that's actually mmap again), etc, etc, etc. Tell me,
which process is responsible for libc being in memory? Other than,
like, all of them?

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-09 Thread Stephen J. Turnbull
Chris Angelico writes:
 > On Tue, Oct 9, 2018 at 10:05 PM Greg Ewing  
 > wrote:
 > >
 > > Chris Angelico wrote:
 > > > In contrast, a mmap'd file is memory that you do indeed own.
 > >
 > > Although it's not really accurate to say that it's owned by
 > > a particular process. If two processes mmap the same file,
 > > the physical memory pages holding it appear in the address
 > > spaces of both processes.

Subject to COW, I presume.  Probably in units smaller than the whole
file (per page?)

 > Tell me, which process is responsible for libc being in memory? 
 > Other than, like, all of them?

Yes.  Why would you want a different answer?

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-09 Thread Chris Angelico
On Wed, Oct 10, 2018 at 2:42 AM Stephen J. Turnbull
 wrote:
>
> Chris Angelico writes:
>  > On Tue, Oct 9, 2018 at 10:05 PM Greg Ewing  
> wrote:
>  > >
>  > > Chris Angelico wrote:
>  > > > In contrast, a mmap'd file is memory that you do indeed own.
>  > >
>  > > Although it's not really accurate to say that it's owned by
>  > > a particular process. If two processes mmap the same file,
>  > > the physical memory pages holding it appear in the address
>  > > spaces of both processes.
>
> Subject to COW, I presume.  Probably in units smaller than the whole
> file (per page?)

Both processes are using the virtual memory. Either or both could be
using physical memory. Assuming they haven't written to the pages
(which is the case with executables - the system mmaps the binary into
your memory space as read-only), and assuming that those pages are
backed by physical memory, which process is using that memory?

>  > Tell me, which process is responsible for libc being in memory?
>  > Other than, like, all of them?
>
> Yes.  Why would you want a different answer?

Because that would mean that I have way more *physical* memory in use
than I actually have chips on the motherboard for. Yes, virtual memory
can be over-allocated, but physical memory can't. How do you measure
where your physical memory is being used, if adding up the usage of
all processes exceeds the available resources? It's like trying to
figure out which countries the people of this planet live in, and
saying that there are X billion in China, Y million in Australia, etc,
etc, etc, and then discovering that you're counting people multiple
times if they ever move... which means that there are ten or twenty
billion people on the planet.

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-09 Thread Stephen J. Turnbull
Chris Angelico writes:

 > Both processes are using the virtual memory. Either or both could be
 > using physical memory. Assuming they haven't written to the pages
 > (which is the case with executables - the system mmaps the binary into
 > your memory space as read-only), and assuming that those pages are
 > backed by physical memory, which process is using that memory?

One doesn't know.  Clever side-channel attacks aside, I don't care,
and I don't see how it could matter.

 > >  > Tell me, which process is responsible for libc being in memory?
 > >  > Other than, like, all of them?
 > >
 > > Yes.  Why would you want a different answer?
 > 
 > Because that would mean that I have way more *physical* memory in use
 > than I actually have chips on the motherboard for.

No, that's like saying that because you have multiple links to a file
on disk you're using more physical disk than you have.

 > Yes, virtual memory can be over-allocated, but physical memory
 > can't. How do you measure where your physical memory is being used,
 > if adding up the usage of all processes exceeds the available
 > resources?

lsof can do it, I guess.  The kernel knows which pages of which
processes' virtual memory are backed by physical memory.  But just as
running over the array of inodes tells you how much physical disk is
allocated, without looking at directories to find out what it's
allocated to, the kernel probably has a bitmap or similar so it knows
which pages of physical memory are in use.

All problems are easy once you have enough indirection. :^)
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-09 Thread Chris Angelico
On Wed, Oct 10, 2018 at 5:09 AM Stephen J. Turnbull
 wrote:
>
> Chris Angelico writes:
>
>  > Both processes are using the virtual memory. Either or both could be
>  > using physical memory. Assuming they haven't written to the pages
>  > (which is the case with executables - the system mmaps the binary into
>  > your memory space as read-only), and assuming that those pages are
>  > backed by physical memory, which process is using that memory?
>
> One doesn't know.  Clever side-channel attacks aside, I don't care,
> and I don't see how it could matter.

It matters a lot when you're trying to figure out what your system is doing.

>  > >  > Tell me, which process is responsible for libc being in memory?
>  > >  > Other than, like, all of them?
>  > >
>  > > Yes.  Why would you want a different answer?
>  >
>  > Because that would mean that I have way more *physical* memory in use
>  > than I actually have chips on the motherboard for.
>
> No, that's like saying that because you have multiple links to a file
> on disk you're using more physical disk than you have.

Actually, that's exactly the same problem, with exactly the same
consequences. How do you figure out why your disk is full? How do you
enforce disk quotas? How can you get any sort of reasonable metrics
about anything when the sum of everything vastly exceeds the actual
usage?

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-09 Thread Greg Ewing

Stephen J. Turnbull wrote:


Subject to COW, I presume.  Probably in units smaller than the whole
file (per page?)


It can be COW or not, depending on the options passed to mmap.
And yes, it's mapped in units of pages.

--
Greg
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-09 Thread Cameron Simpson

On 10Oct2018 00:42, Stephen J. Turnbull  
wrote:

Chris Angelico writes:
> On Tue, Oct 9, 2018 at 10:05 PM Greg Ewing  
wrote:
> > Chris Angelico wrote:
> > > In contrast, a mmap'd file is memory that you do indeed own.
> >
> > Although it's not really accurate to say that it's owned by
> > a particular process. If two processes mmap the same file,
> > the physical memory pages holding it appear in the address
> > spaces of both processes.

Subject to COW, I presume.  Probably in units smaller than the whole
file (per page?)


Yes, pages (whatever using the memory paging service works in).

But not COW. Changes to the memory affect the file contents and are 
visible to the other process. It is mapped, not pseudocopied.


Cheers,
Cameron Simpson 
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-09 Thread Stephen J. Turnbull
Chris Angelico writes:
 > On Wed, Oct 10, 2018 at 5:09 AM Stephen J. Turnbull
 >  wrote:
 > >
 > > Chris Angelico writes:
 > >
 > >  > Both processes are using the virtual memory. Either or both could be
 > >  > using physical memory. Assuming they haven't written to the pages
 > >  > (which is the case with executables - the system mmaps the binary into
 > >  > your memory space as read-only), and assuming that those pages are
 > >  > backed by physical memory, which process is using that memory?
 > >
 > > One doesn't know.  Clever side-channel attacks aside, I don't care,
 > > and I don't see how it could matter.
 > 
 > It matters a lot when you're trying to figure out what your system
 > is doing.

Sure, but knowing how your system works is far more important.  Eg,
create a 1TB file on a POSIX system, delete it while a process still
has it opened, and it doesn't matter how you process the output of du
or ls, you still have 1TB of used file space not accounted for.  The
same applies to swapfiles.  But "df" knows and will tell you.

In fact, "ps" will tell you how much shared memory a process is using.

I just don't see a problem here, on the "I'm not getting the data I
need" side.  You do have access to the data you need.

 > >  > >  > Tell me, which process is responsible for libc being in memory?
 > >  > >  > Other than, like, all of them?
 > >  > >
 > >  > > Yes.  Why would you want a different answer?
 > >  >
 > >  > Because that would mean that I have way more *physical* memory in use
 > >  > than I actually have chips on the motherboard for.
 > >
 > > No, that's like saying that because you have multiple links to a file
 > > on disk you're using more physical disk than you have.
 > 
 > Actually, that's exactly the same problem, with exactly the same
 > consequences. How do you figure out why your disk is full? How do you
 > enforce disk quotas? How can you get any sort of reasonable metrics
 > about anything when the sum of everything vastly exceeds the actual
 > usage?

You add up the right things, of course, and avoid paradoxes.

The disk quota enforcement problem is indeed hard.  This sounds to me
like a special problem studied in cost accounting, a problem which was
solved (a computation that satisfies certain axioms was shown to exist
and be unique) in a sense by Aumann and Shapley in the 1970s.  The A-S
prices have been used by telephone carriers to allocate costs of fixed
assets with capacity constraints to individual calls, though I don't
know if the method is still in use.  I'm not sure if the disk quota
problem fits the A-S theorem (which imposes certain monotonicity
conditions), but the general paradigm does.

However, the quota problem (and in general, the problem of allocation
of overhead) is "hard" even if you have complete information about the
system, because it's a values problem: what events are bad? what
events are worse? what events are unacceptable (result in bankruptcy
and abandonment of the system)?  Getting very complete, accurate
information about the physical consequences of individual events in
the system (linking to a file on disk, allocating a large quantity of
viritual memory) is not difficult, in the sense that you throw money
and engineers at it, and you get "df".  Getting very complete,
accurate information about the values you're trying to satisfy is
possible only for an omniscient god, even if, as in business, they can
be measured in currency units.

Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support parsing stream with `re`

2018-10-12 Thread James Lu
The file system is really just a b-tree. If you’re concerned about using memory, you can implement a O(log n) map using the file system, where the entires are the different critical sections.Every node is a folder and every file is a leaf. Many package managers implement maps like this. I’d like to propose a second alternative. You can use a script that’s responsible for creating the file of commands. The file is actually a FIFO pipe. The script streams data to the file, but to save resources it stops when the file/buffer reaches a certain size. The command-running code works in a different process. Python has good capability for handling streams. On Oct 10, 2018, at 1:27 AM, Stephen J. Turnbull  wrote:Chris Angelico writes:On Wed, Oct 10, 2018 at 5:09 AM Stephen J. Turnbull wrote:Chris Angelico writes:Both processes are using the virtual memory. Either or both could beusing physical memory. Assuming they haven't written to the pages(which is the case with executables - the system mmaps the binary intoyour memory space as read-only), and assuming that those pages arebacked by physical memory, which process is using that memory?One doesn't know.  Clever side-channel attacks aside, I don't care,and I don't see how it could matter.It matters a lot when you're trying to figure out what your systemis doing.Sure, but knowing how your system works is far more important.  Eg,create a 1TB file on a POSIX system, delete it while a process stillhas it opened, and it doesn't matter how you process the output of duor ls, you still have 1TB of used file space not accounted for.  Thesame applies to swapfiles.  But "df" knows and will tell you.In fact, "ps" will tell you how much shared memory a process is using.I just don't see a problem here, on the "I'm not getting the data Ineed" side.  You do have access to the data you need.Tell me, which process is responsible for libc being in memory?Other than, like, all of them?Yes.  Why would you want a different answer?Because that would mean that I have way more *physical* memory in usethan I actually have chips on the motherboard for.No, that's like saying that because you have multiple links to a fileon disk you're using more physical disk than you have.Actually, that's exactly the same problem, with exactly the sameconsequences. How do you figure out why your disk is full? How do youenforce disk quotas? How can you get any sort of reasonable metricsabout anything when the sum of everything vastly exceeds the actualusage?You add up the right things, of course, and avoid paradoxes.The disk quota enforcement problem is indeed hard.  This sounds to melike a special problem studied in cost accounting, a problem which wassolved (a computation that satisfies certain axioms was shown to existand be unique) in a sense by Aumann and Shapley in the 1970s.  The A-Sprices have been used by telephone carriers to allocate costs of fixedassets with capacity constraints to individual calls, though I don'tknow if the method is still in use.  I'm not sure if the disk quotaproblem fits the A-S theorem (which imposes certain monotonicityconditions), but the general paradigm does.However, the quota problem (and in general, the problem of allocationof overhead) is "hard" even if you have complete information about thesystem, because it's a values problem: what events are bad? whatevents are worse? what events are unacceptable (result in bankruptcyand abandonment of the system)?  Getting very complete, accurateinformation about the physical consequences of individual events inthe system (linking to a file on disk, allocating a large quantity ofviritual memory) is not difficult, in the sense that you throw moneyand engineers at it, and you get "df".  Getting very complete,accurate information about the values you're trying to satisfy ispossible only for an omniscient god, even if, as in business, they canbe measured in currency units.Steve___Python-ideas mailing listPython-ideas@python.orghttps://mail.python.org/mailman/listinfo/python-ideasCode of Conduct: http://python.org/psf/codeofconduct/___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/