Re: Mutating an HTML file with BeautifulSoup
I've had much success doing round trips through the lxml.html parser. https://lxml.de/lxmlhtml.html I ditched bs for lxml long ago and never regretted it. If you find that you have a bunch of invalid html that lxml inadvertently "fixes", I would recommend adding a stutter-step to your project: perform a noop roundtrip thru lxml on all files. I'd then analyze any diff by progressively excluding changes via `grep -vP`. Unless I'm mistaken, all such changes should fall into no more than a dozen groups. On Fri, Aug 19, 2022, 1:34 PM Chris Angelico wrote: > What's the best way to precisely reconstruct an HTML file after > parsing it with BeautifulSoup? > > Using the Alice example from the BS4 docs: > > >>> html_doc = """The Dormouse's story > > The Dormouse's story > > Once upon a time there were three little sisters; and > their names were > http://example.com/elsie; class="sister" id="link1">Elsie, > http://example.com/lacie; class="sister" id="link2">Lacie and > http://example.com/tillie; class="sister" id="link3">Tillie; > and they lived at the bottom of a well. > > ... > """ > >>> print(soup) > The Dormouse's story > > The Dormouse's story > Once upon a time there were three little sisters; and > their names were > http://example.com/elsie; id="link1">Elsie, > http://example.com/lacie; id="link2">Lacie and > http://example.com/tillie; id="link3">Tillie; > and they lived at the bottom of a well. > ... > > >>> > > Note two distinct changes: firstly, whitespace has been removed, and > secondly, attributes are reordered (I think alphabetically). There are > other canonicalizations being done, too. > > I'm trying to make some automated changes to a huge number of HTML > files, with minimal diffs so they're easy to validate. That means that > spurious changes like these are very much unwanted. Is there a way to > get BS4 to reconstruct the original precisely? > > The mutation itself would be things like finding an anchor tag and > changing its href attribute. Fairly simple changes, but might alter > the length of the file (eg changing "http://example.com/; into > "https://example.com/;). I'd like to do them intelligently rather than > falling back on element.sourceline and element.sourcepos, but worst > case, that's what I'll have to do (which would be fiddly). > > ChrisA > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list
Re: New Python implementation
On Thu, Feb 11, 2021 at 1:49 PM dn via Python-list wrote: > When I first met it, one of the concepts I found difficult to 'wrap my > head around' was the idea that "open software" allowed folk to fork the > original work and 'do their own thing'. My thinking was (probably) > "surely, the original is the authoritative version". Having other > versions seemed an invitation to confusion and dilution. > > However, as soon as (open) software is made available, other people > start making it 'better' - whatever their own definition of "better". > > Yes, it is both a joy and a complication. > > ... > > Wishing you well. It seems (to (neos-ignorant) me at least) an ambitious > project. There are certainly times when 'execution speed' becomes a > major criteria. Many of us will look forward to (your development of) a > solution. Please let us know when it's ready for use/trials... > Well put! Thank you for this thoughtful and informative message. You obviously put substantial work into it. -- https://mail.python.org/mailman/listinfo/python-list
Re: Explicit vararg values
Received? On Sun, Sep 16, 2018 at 3:39 PM Buck Evan wrote: > I started to send this to python-ideas, but I'm having second thoughts. > Does tihs have merit? > > --- > I stumble on this a lot, and I see it in many python libraries: > > def f(*args, **kwargs): > ... > > f(*[list comprehension]) > f(**mydict) > > It always seems a shame to carefully build up an object in order to > explode it, just to pack it into a near-identical object. > > Today I was fiddling with the new python3.7 inspect.signature > functionality when I ran into this case: > > def f(**kwargs): pass > sig = inspect.signature(f) > print(sig.bind(a=1, b=2)) > > The output is "". I found this a > bit humorous since anyone attempting to bind values in this way, using > f(kwargs={'a': 1, 'b': 2}) will be sorely dissappointed. I also wondered > why BoundArguments didn't print '**kwargs' since that's the __str__ of that > parameter object. > > The syntax I'm proposing is: >f(**kwargs={'a': 1, 'b': 2}) > > as a synonym of f(a=1, b=2) when an appropriate dictionary is already on > hand. > > --- > I can argue for this another way as well. > > 1) > When both caller and callee have a known number of values to pass/receive, > that's the usual syntax: > def f(x) and f(1) > > 2) > When the caller has a fixed set of values, but the callee wants to handle > a variable number: def f(*args) and f(1) > > 3) > Caller has a variable number of arguments (varargs) but the call-ee is > fixed, that's the splat operator: def f(x) and f(*args) > > 4) > When case 1 and 3 cross paths, and we have a vararg in both the caller and > callee, right now we're forced to splat both sides: def f(*args) and > f(*args), but I'd like the option of opting-in to passing along my list > as-is with no splat or collection operations involved: def f(*args) and > f(*args=args) > > Currently the pattern to handle case 4 neatly is to define two versions of > a vararg function: > > def f(*arg, **kwargs): > return _f(args, kwargs) > > return _f(args, kwargs): > ... > > Such that when internal calllers hit case 4, there's a simple and > efficient way forward -- use the internal de-vararg'd definition of f. > External callers have no such option though, without breaking protected api > convention. > > My proposal would simplify this implementation as well as allowing users > to make use of a similar calling convention that was only provided > privately before. > > Examples: > > log(*args) and _log(args) in logging.Logger > format and vformat of strings.Formatter > -- https://mail.python.org/mailman/listinfo/python-list
Explicit vararg values
I started to send this to python-ideas, but I'm having second thoughts. Does tihs have merit? --- I stumble on this a lot, and I see it in many python libraries: def f(*args, **kwargs): ... f(*[list comprehension]) f(**mydict) It always seems a shame to carefully build up an object in order to explode it, just to pack it into a near-identical object. Today I was fiddling with the new python3.7 inspect.signature functionality when I ran into this case: def f(**kwargs): pass sig = inspect.signature(f) print(sig.bind(a=1, b=2)) The output is "". I found this a bit humorous since anyone attempting to bind values in this way, using f(kwargs={'a': 1, 'b': 2}) will be sorely dissappointed. I also wondered why BoundArguments didn't print '**kwargs' since that's the __str__ of that parameter object. The syntax I'm proposing is: f(**kwargs={'a': 1, 'b': 2}) as a synonym of f(a=1, b=2) when an appropriate dictionary is already on hand. --- I can argue for this another way as well. 1) When both caller and callee have a known number of values to pass/receive, that's the usual syntax: def f(x) and f(1) 2) When the caller has a fixed set of values, but the callee wants to handle a variable number: def f(*args) and f(1) 3) Caller has a variable number of arguments (varargs) but the call-ee is fixed, that's the splat operator: def f(x) and f(*args) 4) When case 1 and 3 cross paths, and we have a vararg in both the caller and callee, right now we're forced to splat both sides: def f(*args) and f(*args), but I'd like the option of opting-in to passing along my list as-is with no splat or collection operations involved: def f(*args) and f(*args=args) Currently the pattern to handle case 4 neatly is to define two versions of a vararg function: def f(*arg, **kwargs): return _f(args, kwargs) return _f(args, kwargs): ... Such that when internal calllers hit case 4, there's a simple and efficient way forward -- use the internal de-vararg'd definition of f. External callers have no such option though, without breaking protected api convention. My proposal would simplify this implementation as well as allowing users to make use of a similar calling convention that was only provided privately before. Examples: log(*args) and _log(args) in logging.Logger format and vformat of strings.Formatter -- https://mail.python.org/mailman/listinfo/python-list
[issue34706] Signature.from_callable sometimes drops subclassing
Change by Buck Evan : -- type: -> behavior ___ Python tracker <https://bugs.python.org/issue34706> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue34706] Signature.from_callable sometimes drops subclassing
New submission from Buck Evan : Specifically in the case of a class that does not override its constructor signature inherited from object. Github PR incoming shortly. -- components: Library (Lib) messages: 325501 nosy: bukzor priority: normal severity: normal status: open title: Signature.from_callable sometimes drops subclassing versions: Python 3.7 ___ Python tracker <https://bugs.python.org/issue34706> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24085] large memory overhead when pyc is recompiled
Buck Evan added the comment: @serhiy.storchaka This is a very stable piece of a legacy code base, so we're not keen to refactor it so dramatically, although we could. We've worked around this issue by compiling pyc files ahead of time and taking extra care that they're preserved through deployment. This isn't blocking our 2.7 transition anymore. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24085 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24085] large memory overhead when pyc is recompiled
Buck Evan added the comment: New data: The memory consumption seems to be in the compiler rather than the marshaller: ``` $ PYTHONDONTWRITEBYTECODE=1 python -c 'import repro' 16032 $ PYTHONDONTWRITEBYTECODE=1 python -c 'import repro' 16032 $ PYTHONDONTWRITEBYTECODE=1 python -c 'import repro' 16032 $ python -c 'import repro' 16032 $ PYTHONDONTWRITEBYTECODE=1 python -c 'import repro' 8984 $ PYTHONDONTWRITEBYTECODE=1 python -c 'import repro' 8984 $ PYTHONDONTWRITEBYTECODE=1 python -c 'import repro' 8984 ``` We were trying to use PYTHONDONTWRITEBYTECODE as a workaround to this issue, but it didn't help us because of this. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24085 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24085] large memory overhead when pyc is recompiled
New submission from Buck Evan: In the attached example I show that there's a significant memory overhead present whenever a pre-compiled pyc is not present. This only occurs with more than 5225 objects (dictionaries in this case) allocated. At 13756 objects, the mysterious pyc overhead is 50% of memory usage. I've reproduced this issue in python 2.6, 2.7, 3.4. I imagine it's present in all cpythons. $ python -c 'import repro' 16736 $ python -c 'import repro' 8964 $ python -c 'import repro' 8964 $ rm *.pyc; python -c 'import repro' 16740 $ rm *.pyc; python -c 'import repro' 16736 $ rm *.pyc; python -c 'import repro' 16740 -- files: repro.py messages: 242281 nosy: bukzor priority: normal severity: normal status: open title: large memory overhead when pyc is recompiled versions: Python 3.4 Added file: http://bugs.python.org/file39238/repro.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24085 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24085] large memory overhead when pyc is recompiled
Buck Evan added the comment: Also, we've reproduced this in both linux and osx. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24085 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com