Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Jon Ribbens via Python-list
On 2022-08-22, Peter J. Holzer  wrote:
> On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote:
>> With the offset though, BeautifulSoup made an arbitrary decision to
>> use ISO-8859-1 encoding and so when you chopped the bytestring at
>> that offset it only worked because BeautifulSoup had happened to
>> choose a 1-byte-per-character encoding. Ironically, *without* the
>> "\xed\xa0\x80\xed\xbc\x9f" it wouldn't have worked.
>
> Actually it would. The unit is bytes if you feed it with bytes, and
> characters if you feed it with str.

No it isn't. If you give BeautifulSoup's 'html.parser' bytes as input,
it first chooses an encoding and decodes the bytes before sending that
output to html.parser, which is what provides the offset. So the offsets
it gives are in characters, and you've no simple way of converting that
back to byte offsets.

> (OTOH it seems that the html parser doesn't heed any 
> tags, which seems less than ideal for more pedestrian purposes.)

html.parser doesn't accept bytes as input, so it couldn't do anything
with the encoding even if it knew it. BeautifulSoup's 'html.parser'
however does look for and use  (using a regexp, natch).

>> It looks like BeautifulSoup is doing something like that, yes.
>> Personally I would be nervous about some of my files being parsed
>> as UTF-8 and some of them ISO-8859-1 (due to decoding errors rather
>> than some of the files actually *being* ISO-8859-1 ;-) )
>
> Since none of the syntactically meaningful characters have a code >=
> 0x80, you can parse HTML at the byte level if you know that it's encoded
> in a strict superset of ASCII (which all of the ISO-8859 family and
> UTF-8 are). Only if that's not true (e.g. if your files might be UTF-16
> (or Shift-JIS  or EUC, if I remember correctly) then you have to know
> the the character set.
>
> (By parsing I mean only "create a syntax tree". Obviously you have to
> know the encoding to know whether to display =ABc3 bc=BB as =AB=FC=BB or =
>=AB=C3=BC=BB.)

But the job here isn't to create a syntax tree. It's to change some of
the content, which for all we know is not ASCII.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Peter J. Holzer
On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote:
> With the offset though, BeautifulSoup made an arbitrary decision to
> use ISO-8859-1 encoding and so when you chopped the bytestring at
> that offset it only worked because BeautifulSoup had happened to
> choose a 1-byte-per-character encoding. Ironically, *without* the
> "\xed\xa0\x80\xed\xbc\x9f" it wouldn't have worked.

Actually it would. The unit is bytes if you feed it with bytes, and
characters if you feed it with str. So in any case you can use the
offset on the data you fed to the parser. Maybe not what you expected,
but seems quite useful for what Chris has in mind.

(OTOH it seems that the html parser doesn't heed any 
tags, which seems less than ideal for more pedestrian purposes.)

> > So I would probably just let this one go through as 8859-1.
> 
> It looks like BeautifulSoup is doing something like that, yes.
> Personally I would be nervous about some of my files being parsed
> as UTF-8 and some of them ISO-8859-1 (due to decoding errors rather
> than some of the files actually *being* ISO-8859-1 ;-) )

Since none of the syntactically meaningful characters have a code >=
0x80, you can parse HTML at the byte level if you know that it's encoded
in a strict superset of ASCII (which all of the ISO-8859 family and
UTF-8 are). Only if that's not true (e.g. if your files might be UTF-16
(or Shift-JIS  or EUC, if I remember correctly) then you have to know
the the character set.

(By parsing I mean only "create a syntax tree". Obviously you have to
know the encoding to know whether to display «c3 bc» as «ü» or «Ã¼».)

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Peter J. Holzer
On 2022-08-22 00:09:01 -, Jon Ribbens via Python-list wrote:
> On 2022-08-21, Peter J. Holzer  wrote:
> > On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote:
> >>   result = re.sub(
> >>   r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""",
> >
> > This will fail on:
> > 
> 
> I've seen *a lot* of bad/broken/weird HTML over the years, and I don't
> believe I've ever seen anyone do that. (Wrongly putting an 'alt'
> attribute on an 'a' element is very common, on the other hand ;-) )

My bad. I meant title, not alt, of course. The unescaped > is completely
standard conforming HTML, however (both HTML 4.01 strict and HTML 5).
You almost never have to escape > - in fact I can't think of any case
right now - and I generally don't (sometimes I do for symmetry with <,
but that's an aesthetic choice, not a technical one).


> > The problem can be solved with regular expressions (and given the
> > constraints I think I would prefer that to using Beautiful Soup), but
> > getting the regexps right is not trivial, at least in the general case.
> 
> I would like to see the regular expression that could fully parse
> general HTML...

That depends on what you mean by "parse".

If you mean "construct a DOM tree", you can't since regular expressions
(in the mathematical sense, not what's implemented by some programming
languages) by definition describe finite automata, and those don't
support recursion.

But if you mean "split into a sequence of tags and PCDATA's (and then
each tag further into its attributes)", that's absolutely possible, and
that's all that is needed here. I don't think I have ever implemented a
complete solution (if only because stuff like  is
extremely rare), but I should have some Perl code lying around which
worked on a wide variety of HTML. I just have to find it again ...

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Jon Ribbens via Python-list
On 2022-08-21, Chris Angelico  wrote:
> On Mon, 22 Aug 2022 at 05:43, Jon Ribbens via Python-list
> wrote:
>> On 2022-08-21, Chris Angelico  wrote:
>> > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
>> > wrote:
>> >> On 2022-08-20, Chris Angelico  wrote:
>> >> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram  
>> >> > wrote:
>> >> >> 2qdxy4rzwzuui...@potatochowder.com writes:
>> >> >> >textual representations.  That way, the following two elements are the
>> >> >> >same (and similar with a collection of sub-elements in a different 
>> >> >> >order
>> >> >> >in another document):
>> >> >>
>> >> >>   The /elements/ differ. They have the /same/ infoset.
>> >> >
>> >> > That's the bit that's hard to prove.
>> >> >
>> >> >>   The OP could edit the files with regexps to create a new version.
>> >> >
>> >> > To you and Jon, who also suggested this: how would that be beneficial?
>> >> > With Beautiful Soup, I have the line number and position within the
>> >> > line where the tag starts; what does a regex give me that I don't have
>> >> > that way?
>> >>
>> >> You mean you could use BeautifulSoup to read the file and identify the
>> >> bits you want to change by line number and offset, and then you could
>> >> use that data to try and update the file, hoping like hell that your
>> >> definition of "line" and "offset" are identical to BeautifulSoup's
>> >> and that you don't mess up later changes when you do earlier ones (you
>> >> could do them in reverse order of line and offset I suppose) and
>> >> probably resorting to regexps anyway in order to find the part of the
>> >> tag you want to change ...
>> >>
>> >> ... or you could avoid all that faff and just do re.sub()?
>> >
>> > Stefan answered in part, but I'll add that it is far FAR easier to do
>> > the analysis with BS4 than regular expressions. I'm not sure what
>> > "hoping like hell" is supposed to mean here, since the line and offset
>> > have been 100% accurate in my experience;
>>
>> Given the string:
>>
>> b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8?"
>>
>> what is the line number and offset of the question mark - and does
>> BeautifulSoup agree with your answer? Does the answer to that second
>> question change depending on what parser you tell BeautifulSoup to use?
>
> I'm not sure, because I don't know how to ask BS4 about the location
> of a question mark. But I replaced that with a tag, and:
>
 raw = b"\n 
 \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8"
 from bs4 import BeautifulSoup
 soup = BeautifulSoup(raw, "html.parser")
 soup.body.sourceline
> 4
 soup.body.sourcepos
> 12
 raw.split(b"\n")[3]
> b'\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8'
 raw.split(b"\n")[3][12:]
> b''
>
> So, yes, it seems to be correct. (Slightly odd in that the sourceline
> is 1-based but the sourcepos is 0-based, but that is indeed the case,
> as confirmed with a much more straight-forward string.)
>
> And yes, it depends on the parser, but I'm using html.parser and it's fine.

Hah, yes, it appears html.parser does an end-run about my lovely
carefully crafted hard case by not even *trying* to work out what
type of line endings the file uses and is just hard-coded to only
recognise "\n" as a line ending.

With the offset though, BeautifulSoup made an arbitrary decision to
use ISO-8859-1 encoding and so when you chopped the bytestring at
that offset it only worked because BeautifulSoup had happened to
choose a 1-byte-per-character encoding. Ironically, *without* the
"\xed\xa0\x80\xed\xbc\x9f" it wouldn't have worked.

>> (If your answer is "if the input contains \xed\xa0\x80\xed\xbc\x9f then
>> I am happy with the program throwing an exception" then feel free to
>> remove that substring from the question.)
>
> Malformed UTF-8 doesn't seem to be a problem. Every file here seems to
> be either UTF-8 or ISO-8859, and in the latter case, I'm assuming
> 8859-1. So I would probably just let this one go through as 8859-1.

It looks like BeautifulSoup is doing something like that, yes.
Personally I would be nervous about some of my files being parsed
as UTF-8 and some of them ISO-8859-1 (due to decoding errors rather
than some of the files actually *being* ISO-8859-1 ;-) )

>> > the only part I'm unsure about is where the _end_ of the tag is (and
>> > maybe there's a way I can use BS4 again to get that??).
>>
>> There doesn't seem to be. More to the point, there doesn't seem to be
>> a way to find out where the *attributes* are, so as I said you'll most
>> likely end up using regexps anyway.
>
> I'm okay with replacing an entire tag that needs to be changed.

Oh, that seems like quite a big change to the original problem.

> Especially if I can replace just the opening tag, not the contents and
> closing tag. And in fact, I may just do that part by scanning for an
> unencoded greater-than, on the assumptions that (a) BS4 will correctly
> encode any greater-thans in attributes,

But your input wasn't created by 

Re: Python scripts in .exe form

2022-08-22 Thread Mona Lee
I didn't create exe files they kind of just appeared I guess? Perhaps somewhere 
in the process of redownloading my python/visual studio?

My situation is similar to this person's description that I found online
https://stackoverflow.com/questions/62315149/why-are-my-python-packages-being-installed-to-this-strange-folder
 

On Saturday, August 20, 2022 at 7:25:31 AM UTC-6, Jim Schwartz wrote:
> What method did you use to create the exe file from your python scripts? If 
> it was pyinstaller, then it puts the compiled versions of these python 
> scripts in a windows temp folder when you run them. You’ll be able to get the 
> scripts from there. 
> 
> Sent from my iPhone 
> 
> > On Aug 19, 2022, at 9:51 PM, Mona Lee wrote:
> > 
> > I'm pretty new to Python, and I had to do some tinkering because I was 
> > running into issues with trying to download a package from PIP and must've 
> > caused some issues in my program that I don't know how to fix
> > 
> > 1. It started when I was unable to update PIP to the newest version because 
> > of some "Unknown error" (VS Code error - unable to read file - 
> > (Unknown(FileSystemError) where I believe some file was not saved in the 
> > right location? 
> > 
> > 2. In my command line on VS code there used to be the prefix that looked 
> > something like "PS C:\Users\[name]>" but now it is "PS 
> > C:\Users\[name]\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\Scripts>
> >  
> > 
> > From there I redownloaded my VS code but still have the 2) issue. 
> > 
> > also, my scripts are now in the .exe form that I cannot access because "it 
> > is either binary or in a unsupported text encoding" I've tried to extract 
> > it back into the .py form using pyinstxtractor and decompile-python3 but I 
> > can't successfully work these. 
> > 
> > 3. also wanted to mention that some of my old Python programs are missing.
> > -- 
> > https://mail.python.org/mailman/listinfo/python-list
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Jon Ribbens via Python-list
On 2022-08-21, Peter J. Holzer  wrote:
> On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote:
>> On 2022-08-20, Stefan Ram  wrote:
>> > Jon Ribbens  writes:
>> >>... or you could avoid all that faff and just do re.sub()?
>
>> > source = ''
>> >
>> > # Use Python to change the source, keeping the order of attributes.
>> >
>> > result = re.sub( r'href\s*=\s*"http"', r'href="https"', source )
>> > result = re.sub( r"href\s*=\s*'http'", r"href='https'", result )
>
> Depending on the content of the site, this might replace some stuff
> which is not a link.
>
>> You could go a bit harder with the regexp of course, e.g.:
>> 
>>   result = re.sub(
>>   r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""",
>
> This will fail on:
> 

I've seen *a lot* of bad/broken/weird HTML over the years, and I don't
believe I've ever seen anyone do that. (Wrongly putting an 'alt'
attribute on an 'a' element is very common, on the other hand ;-) )

> The problem can be solved with regular expressions (and given the
> constraints I think I would prefer that to using Beautiful Soup), but
> getting the regexps right is not trivial, at least in the general case.

I would like to see the regular expression that could fully parse
general HTML...
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: subprocess.popen how wait complete open process

2022-08-22 Thread Eryk Sun
On 8/21/22, simone zambonardi  wrote:
> Hi, I am running a program with the punishment subrocess.Popen(...) what I
> should do is to stop the script until the launched program is fully open.
> How can I do this? I used a time.sleep() function but I think there are
> other ways. Thanks

In Windows, WaitForInputIdle() waits until a thread in a process
creates one or more windows and its message loop goes idle. Usually
this is the main UI thread. Console processes are not supported.

For example:

import ctypes
import subprocess

user32 = ctypes.WinDLL('user32', use_last_error=True)

INFINITE = 0x_
WAIT_FAILED = 0x_
WAIT_TIMEOUT = 0x_0102

# Waiting on a console process fails with ERROR_NOT_GUI_PROCESS.
# This case be handled in other ways, depending on the need.
ERROR_NOT_GUI_PROCESS = 1471

user32.WaitForInputIdle.restype = ctypes.c_ulong
user32.WaitForInputIdle.argtypes = (ctypes.c_void_p, ctypes.c_ulong)

def wait_for_input_idle(proc, timeout=None):
if isinstance(proc, subprocess.Popen):
handle = int(proc._handle)
args = p.args
else:
handle = int(proc)
args = ''
if timeout is None:
timeout_ms = INFINITE
elif timeout < 0:
raise ValueError('timeout cannot be negative')
else:
timeout_ms = int(timeout * 1000)
if timeout_ms >= INFINITE:
raise OverflowError('timeout is too large')
status = user32.WaitForInputIdle(handle, timeout_ms)
if status == WAIT_FAILED:
raise ctypes.WinError(ctypes.get_last_error())
elif status == WAIT_TIMEOUT:
raise subprocess.TimeoutExpired(args, timeout)
assert status == 0
return


if __name__ == '__main__':
import time
t0 = time.time()
p = subprocess.Popen(['pythonw.exe', '-m', 'idlelib'])

try:
wait_for_input_idle(p, 5)
except:
p.terminate()
raise

wait_time = time.time() - t0
print(f'wait time: {wait_time:.3f} seconds')
try:
p.wait(5)
except subprocess.TimeoutExpired:
p.terminate()
-- 
https://mail.python.org/mailman/listinfo/python-list


[Python-announce] PyConZA 2022 - Second Call for Submissions

2022-08-22 Thread Neil Muller
This is a second call for submissions to PyConZA 2022.

PyConZA 2022 will take place on the 13th & 14th of October, 2022. This
year, PyConZA will be a hybrid conference (with in-person and online
access) hosted at the Premier Splendid Inn in Umhlanga, Durban.

To accommodate speakers who are unable to travel to Durban, we will be
accepting a small number of talks to be given remotely.

We are looking for the following presentations:
  - Keynotes (45 minute long talks on a subject of general interest)
  - Talks (30 minute long talks on more specific topics)
  - Remote Talks: (30 minute talks to be delivered remotely - note
that the number of remote submissions we can accommodate is limited).

We are accepting submissions for tutorials, which will run on the 12th
of October. Tutorials can either be half-day (4 hours) or full-day (8
hours).

If you would like to give a presentation, please register at
https://za.pycon.org/ and submit your proposal, following the
instructions at https://za.pycon.org/talks/submit-talk/ . We have a
number of tracks available, including: Data Science, Teaching and
Learning with Python, Web, Scientific Computing, Testing and Other
(which includes all talks that don't fall under the mentioned tracks).
We hope to notify accepted presenters by no later than the 14th of
September 2022.

Speakers will be expected to be available after the presentation for a
short Q session. Shared sessions are also possible. The
presentations will be in English.

PyConZA offers a mentorship program for inexperienced speakers. If you
would like assistance preparing your submission, email
t...@za.pycon.org with a rough draft of your talk proposal and we'll
find a suitable experienced speaker to act as a mentor.

If you want to present something that doesn't fit into the standard
talk categories at PyConZA, please contact the organising committee at
t...@za.pycon.org so we can discuss whether that will be feasible.

--
Neil Muller
On behalf of the PyConZA organising committee
___
Python-announce-list mailing list -- python-announce-list@python.org
To unsubscribe send an email to python-announce-list-le...@python.org
https://mail.python.org/mailman3/lists/python-announce-list.python.org/
Member address: arch...@mail-archive.com


Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Peter Otten

On 22/08/2022 05:30, Chris Angelico wrote:

On Mon, 22 Aug 2022 at 10:04, Buck Evan  wrote:


I've had much success doing round trips through the lxml.html parser.

https://lxml.de/lxmlhtml.html

I ditched bs for lxml long ago and never regretted it.

If you find that you have a bunch of invalid html that lxml inadvertently 
"fixes", I would recommend adding a stutter-step to your project: perform a 
noop roundtrip thru lxml on all files. I'd then analyze any diff by progressively 
excluding changes via `grep -vP`.
Unless I'm mistaken, all such changes should fall into no more than a dozen 
groups.



Will this round-trip mutate every single file and reorder the tag
attributes? Because I really don't want to manually eyeball all those
changes.


Most certainly not. Reordering is a bs4 feature that is governed by a
formatter. You can easily prevent that attributes are reorderd:

>>> import bs4
>>> soup = bs4.BeautifulSoup("")
>>> soup

>>> class Formatter(bs4.formatter.HTMLFormatter):
def attributes(self, tag):
return [] if tag.attrs is None else list(tag.attrs.items())

>>> soup.decode(formatter=Formatter())
''

Blank space is probably removed by the underlying html parser.
It might be possible to make bs4 instantiate the lxml.html.HTMLParser
with remove_blank_text=False, but I didn't try hard enough ;)

That said, for my humble html scraping needs I have ditched bs4 in favor
of lxml and its xpath capabilities.


--
https://mail.python.org/mailman/listinfo/python-list