Re: subprocess.popen how wait complete open process
On Mon, 22 Aug 2022 at 13:41, Dan Stromberg wrote: > > > > On Sun, Aug 21, 2022 at 2:05 PM Chris Angelico wrote: >> >> On Mon, 22 Aug 2022 at 05:39, simone zambonardi >> wrote: >> > >> > Hi, I am running a program with the punishment subrocess.Popen(...) what I >> > should do is to stop the script until the launched program is fully open. >> > How can I do this? I used a time.sleep() function but I think there are >> > other ways. Thanks >> > >> >> First you have to define "fully open". How would you know? > > > If you're on X11, you could conceivably use: > xwininfo -tree -root > That's only one possible definition: it has some sort of window. But to wait until a program is "fully open", you might have to wait past a splash screen until it has its actual application window. Or maybe even then, it's not ready for operation. Only the OP can know what defines "fully open". ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: subprocess.popen how wait complete open process
On Sun, Aug 21, 2022 at 2:05 PM Chris Angelico wrote: > On Mon, 22 Aug 2022 at 05:39, simone zambonardi > wrote: > > > > Hi, I am running a program with the punishment subrocess.Popen(...) what > I should do is to stop the script until the launched program is fully open. > How can I do this? I used a time.sleep() function but I think there are > other ways. Thanks > > > > First you have to define "fully open". How would you know? > If you're on X11, you could conceivably use: xwininfo -tree -root -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On Mon, 22 Aug 2022 at 10:04, Buck Evan wrote: > > I've had much success doing round trips through the lxml.html parser. > > https://lxml.de/lxmlhtml.html > > I ditched bs for lxml long ago and never regretted it. > > If you find that you have a bunch of invalid html that lxml inadvertently > "fixes", I would recommend adding a stutter-step to your project: perform a > noop roundtrip thru lxml on all files. I'd then analyze any diff by > progressively excluding changes via `grep -vP`. > Unless I'm mistaken, all such changes should fall into no more than a dozen > groups. > Will this round-trip mutate every single file and reorder the tag attributes? Because I really don't want to manually eyeball all those changes. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
I've had much success doing round trips through the lxml.html parser. https://lxml.de/lxmlhtml.html I ditched bs for lxml long ago and never regretted it. If you find that you have a bunch of invalid html that lxml inadvertently "fixes", I would recommend adding a stutter-step to your project: perform a noop roundtrip thru lxml on all files. I'd then analyze any diff by progressively excluding changes via `grep -vP`. Unless I'm mistaken, all such changes should fall into no more than a dozen groups. On Fri, Aug 19, 2022, 1:34 PM Chris Angelico wrote: > What's the best way to precisely reconstruct an HTML file after > parsing it with BeautifulSoup? > > Using the Alice example from the BS4 docs: > > >>> html_doc = """The Dormouse's story > > The Dormouse's story > > Once upon a time there were three little sisters; and > their names were > http://example.com/elsie; class="sister" id="link1">Elsie, > http://example.com/lacie; class="sister" id="link2">Lacie and > http://example.com/tillie; class="sister" id="link3">Tillie; > and they lived at the bottom of a well. > > ... > """ > >>> print(soup) > The Dormouse's story > > The Dormouse's story > Once upon a time there were three little sisters; and > their names were > http://example.com/elsie; id="link1">Elsie, > http://example.com/lacie; id="link2">Lacie and > http://example.com/tillie; id="link3">Tillie; > and they lived at the bottom of a well. > ... > > >>> > > Note two distinct changes: firstly, whitespace has been removed, and > secondly, attributes are reordered (I think alphabetically). There are > other canonicalizations being done, too. > > I'm trying to make some automated changes to a huge number of HTML > files, with minimal diffs so they're easy to validate. That means that > spurious changes like these are very much unwanted. Is there a way to > get BS4 to reconstruct the original precisely? > > The mutation itself would be things like finding an anchor tag and > changing its href attribute. Fairly simple changes, but might alter > the length of the file (eg changing "http://example.com/; into > "https://example.com/;). I'd like to do them intelligently rather than > falling back on element.sourceline and element.sourcepos, but worst > case, that's what I'll have to do (which would be fiddly). > > ChrisA > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On Mon, 22 Aug 2022 at 05:43, Jon Ribbens via Python-list wrote: > > On 2022-08-21, Chris Angelico wrote: > > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list > > wrote: > >> On 2022-08-20, Chris Angelico wrote: > >> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote: > >> >> 2qdxy4rzwzuui...@potatochowder.com writes: > >> >> >textual representations. That way, the following two elements are the > >> >> >same (and similar with a collection of sub-elements in a different > >> >> >order > >> >> >in another document): > >> >> > >> >> The /elements/ differ. They have the /same/ infoset. > >> > > >> > That's the bit that's hard to prove. > >> > > >> >> The OP could edit the files with regexps to create a new version. > >> > > >> > To you and Jon, who also suggested this: how would that be beneficial? > >> > With Beautiful Soup, I have the line number and position within the > >> > line where the tag starts; what does a regex give me that I don't have > >> > that way? > >> > >> You mean you could use BeautifulSoup to read the file and identify the > >> bits you want to change by line number and offset, and then you could > >> use that data to try and update the file, hoping like hell that your > >> definition of "line" and "offset" are identical to BeautifulSoup's > >> and that you don't mess up later changes when you do earlier ones (you > >> could do them in reverse order of line and offset I suppose) and > >> probably resorting to regexps anyway in order to find the part of the > >> tag you want to change ... > >> > >> ... or you could avoid all that faff and just do re.sub()? > > > > Stefan answered in part, but I'll add that it is far FAR easier to do > > the analysis with BS4 than regular expressions. I'm not sure what > > "hoping like hell" is supposed to mean here, since the line and offset > > have been 100% accurate in my experience; > > Given the string: > > b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8?" > > what is the line number and offset of the question mark - and does > BeautifulSoup agree with your answer? Does the answer to that second > question change depending on what parser you tell BeautifulSoup to use? I'm not sure, because I don't know how to ask BS4 about the location of a question mark. But I replaced that with a tag, and: >>> raw = b"\n >>> \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8" >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(raw, "html.parser") >>> soup.body.sourceline 4 >>> soup.body.sourcepos 12 >>> raw.split(b"\n")[3] b'\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8' >>> raw.split(b"\n")[3][12:] b'' So, yes, it seems to be correct. (Slightly odd in that the sourceline is 1-based but the sourcepos is 0-based, but that is indeed the case, as confirmed with a much more straight-forward string.) And yes, it depends on the parser, but I'm using html.parser and it's fine. > (If your answer is "if the input contains \xed\xa0\x80\xed\xbc\x9f then > I am happy with the program throwing an exception" then feel free to > remove that substring from the question.) Malformed UTF-8 doesn't seem to be a problem. Every file here seems to be either UTF-8 or ISO-8859, and in the latter case, I'm assuming 8859-1. So I would probably just let this one go through as 8859-1. > > the only part I'm unsure about is where the _end_ of the tag is (and > > maybe there's a way I can use BS4 again to get that??). > > There doesn't seem to be. More to the point, there doesn't seem to be > a way to find out where the *attributes* are, so as I said you'll most > likely end up using regexps anyway. I'm okay with replacing an entire tag that needs to be changed. Especially if I can replace just the opening tag, not the contents and closing tag. And in fact, I may just do that part by scanning for an unencoded greater-than, on the assumptions that (a) BS4 will correctly encode any greater-thans in attributes, and (b) if there's a mis-encoded one in the input, the diff will be small enough to eyeball, and a human should easily notice that the text has been massively expanded and duplicated. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: subprocess.popen how wait complete open process
On Mon, 22 Aug 2022 at 05:39, simone zambonardi wrote: > > Hi, I am running a program with the punishment subrocess.Popen(...) what I > should do is to stop the script until the launched program is fully open. How > can I do this? I used a time.sleep() function but I think there are other > ways. Thanks > First you have to define "fully open". How would you know? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: subprocess.popen how wait complete open process
Sometimes, launching subprocesses can seem like punishment. I don't think there is a standard cross-platform way to know when a launched asynchronous process is "fully open" (i.e. fully initialized, accepting user input). On Sun, 2022-08-21 at 02:11 -0700, simone zambonardi wrote: > Hi, I am running a program with the punishment subrocess.Popen(...) > what I should do is to stop the script until the launched program is > fully open. How can I do this? I used a time.sleep() function but I > think there are other ways. Thanks -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On 2022-08-21, Chris Angelico wrote: > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list > wrote: >> On 2022-08-20, Chris Angelico wrote: >> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote: >> >> 2qdxy4rzwzuui...@potatochowder.com writes: >> >> >textual representations. That way, the following two elements are the >> >> >same (and similar with a collection of sub-elements in a different order >> >> >in another document): >> >> >> >> The /elements/ differ. They have the /same/ infoset. >> > >> > That's the bit that's hard to prove. >> > >> >> The OP could edit the files with regexps to create a new version. >> > >> > To you and Jon, who also suggested this: how would that be beneficial? >> > With Beautiful Soup, I have the line number and position within the >> > line where the tag starts; what does a regex give me that I don't have >> > that way? >> >> You mean you could use BeautifulSoup to read the file and identify the >> bits you want to change by line number and offset, and then you could >> use that data to try and update the file, hoping like hell that your >> definition of "line" and "offset" are identical to BeautifulSoup's >> and that you don't mess up later changes when you do earlier ones (you >> could do them in reverse order of line and offset I suppose) and >> probably resorting to regexps anyway in order to find the part of the >> tag you want to change ... >> >> ... or you could avoid all that faff and just do re.sub()? > > Stefan answered in part, but I'll add that it is far FAR easier to do > the analysis with BS4 than regular expressions. I'm not sure what > "hoping like hell" is supposed to mean here, since the line and offset > have been 100% accurate in my experience; Given the string: b"\n \r\r\n\v\n\r\xed\xa0\x80\xed\xbc\x9f\xcc\x80e\xc3\xa8?" what is the line number and offset of the question mark - and does BeautifulSoup agree with your answer? Does the answer to that second question change depending on what parser you tell BeautifulSoup to use? (If your answer is "if the input contains \xed\xa0\x80\xed\xbc\x9f then I am happy with the program throwing an exception" then feel free to remove that substring from the question.) > the only part I'm unsure about is where the _end_ of the tag is (and > maybe there's a way I can use BS4 again to get that??). There doesn't seem to be. More to the point, there doesn't seem to be a way to find out where the *attributes* are, so as I said you'll most likely end up using regexps anyway. -- https://mail.python.org/mailman/listinfo/python-list
subprocess.popen how wait complete open process
Hi, I am running a program with the punishment subrocess.Popen(...) what I should do is to stop the script until the launched program is fully open. How can I do this? I used a time.sleep() function but I think there are other ways. Thanks -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote: > On 2022-08-20, Stefan Ram wrote: > > Jon Ribbens writes: > >>... or you could avoid all that faff and just do re.sub()? > > source = '' > > > > # Use Python to change the source, keeping the order of attributes. > > > > result = re.sub( r'href\s*=\s*"http"', r'href="https"', source ) > > result = re.sub( r"href\s*=\s*'http'", r"href='https'", result ) Depending on the content of the site, this might replace some stuff which is not a link. > You could go a bit harder with the regexp of course, e.g.: > > result = re.sub( > r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""", This will fail on: The problem can be solved with regular expressions (and given the constraints I think I would prefer that to using Beautiful Soup), but getting the regexps right is not trivial, at least in the general case. It may become a lot easier if you know that certain conventions were followed (e.g. that ">" was always written as "") or it may become even harder when the files contain errors. hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
> On 21 Aug 2022, at 09:12, Chris Angelico wrote: > > On Sun, 21 Aug 2022 at 17:26, Barry wrote: >> >> >> On 19 Aug 2022, at 22:04, Chris Angelico wrote: >>> >>> On Sat, 20 Aug 2022 at 05:12, Barry wrote: >> On 19 Aug 2022, at 19:33, Chris Angelico wrote: > > What's the best way to precisely reconstruct an HTML file after > parsing it with BeautifulSoup? I recall that in bs4 it parses into an object tree and loses the detail of the input. I recently ported from very old bs to bs4 and hit the same issue. So no it will not output the same as went in. If you can trust the input to be parsed as xml, meaning all the rules of closing tags have been followed. Then I think you can parse and unparse thru xml to do what you want. >>> >>> >>> Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh >>> well. Thanks for trying, anyhow. >>> >>> So I'm left with a few options: >>> >>> 1) Give up on validation, give up on verification, and just run this >>> thing on the production site with my fingers crossed >> >> Can you build a beta site with original intack? > > In a naive way, a full copy would be quite a few gigabytes. I could > cut that down a good bit by taking only HTML files and the things they > reference, but then we run into the same problem of broken links, > which is what we're here to solve in the first place. > > But I would certainly not want to run two copies of the site and then > manually compare. > >> Also wonder if using selenium to walk the site may work as a verification >> step? >> I cannot recall if you can get an image of the browser window to do image >> compares with to look for rendering differences. > > Image recognition won't necessarily even be valid; some of the changes > will have visual consequences (eg a broken image reference now > becoming correct), and as soon as that happens, the whole document can > reflow. > >> From my one task using bs4 I did not see it produce any bad results. >> In my case the problems where in the code that built on bs1 using bad >> assumptions. > > Did that get run on perfect HTML, or on messy real-world stuff that > uses quirks mode? I small number of messy html pages. Barry > > ChrisA > -- > https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
On Sun, 21 Aug 2022 at 17:26, Barry wrote: > > > > > On 19 Aug 2022, at 22:04, Chris Angelico wrote: > > > > On Sat, 20 Aug 2022 at 05:12, Barry wrote: > >> > >> > >> > On 19 Aug 2022, at 19:33, Chris Angelico wrote: > >>> > >>> What's the best way to precisely reconstruct an HTML file after > >>> parsing it with BeautifulSoup? > >> > >> I recall that in bs4 it parses into an object tree and loses the detail of > >> the input. > >> I recently ported from very old bs to bs4 and hit the same issue. > >> So no it will not output the same as went in. > >> > >> If you can trust the input to be parsed as xml, meaning all the rules of > >> closing > >> tags have been followed. Then I think you can parse and unparse thru xml to > >> do what you want. > >> > > > > > > Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh > > well. Thanks for trying, anyhow. > > > > So I'm left with a few options: > > > > 1) Give up on validation, give up on verification, and just run this > > thing on the production site with my fingers crossed > > Can you build a beta site with original intack? In a naive way, a full copy would be quite a few gigabytes. I could cut that down a good bit by taking only HTML files and the things they reference, but then we run into the same problem of broken links, which is what we're here to solve in the first place. But I would certainly not want to run two copies of the site and then manually compare. > Also wonder if using selenium to walk the site may work as a verification > step? > I cannot recall if you can get an image of the browser window to do image > compares with to look for rendering differences. Image recognition won't necessarily even be valid; some of the changes will have visual consequences (eg a broken image reference now becoming correct), and as soon as that happens, the whole document can reflow. > From my one task using bs4 I did not see it produce any bad results. > In my case the problems where in the code that built on bs1 using bad > assumptions. Did that get run on perfect HTML, or on messy real-world stuff that uses quirks mode? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Mutating an HTML file with BeautifulSoup
> On 19 Aug 2022, at 22:04, Chris Angelico wrote: > > On Sat, 20 Aug 2022 at 05:12, Barry wrote: >> >> >> On 19 Aug 2022, at 19:33, Chris Angelico wrote: >>> >>> What's the best way to precisely reconstruct an HTML file after >>> parsing it with BeautifulSoup? >> >> I recall that in bs4 it parses into an object tree and loses the detail of >> the input. >> I recently ported from very old bs to bs4 and hit the same issue. >> So no it will not output the same as went in. >> >> If you can trust the input to be parsed as xml, meaning all the rules of >> closing >> tags have been followed. Then I think you can parse and unparse thru xml to >> do what you want. >> > > > Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh > well. Thanks for trying, anyhow. > > So I'm left with a few options: > > 1) Give up on validation, give up on verification, and just run this > thing on the production site with my fingers crossed Can you build a beta site with original intack? Also wonder if using selenium to walk the site may work as a verification step? I cannot recall if you can get an image of the browser window to do image compares with to look for rendering differences. From my one task using bs4 I did not see it produce any bad results. In my case the problems where in the code that built on bs1 using bad assumptions. > 2) Instead of doing an intelligent reconstruction, just str.replace() > one URL with another within the file > 3) Split the file into lines, find the Nth line (elem.sourceline) and > str.replace that line only > 4) Attempt to use elem.sourceline and elem.sourcepos to find the start > of the tag, manually find the end, and replace one tag with the > reconstructed form. > > I'm inclined to the first option, honestly. The others just seem like > hard work, and I became a programmer so I could be lazy... > > ChrisA > -- > https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list