Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Buck Evan
I've had much success doing round trips through the lxml.html parser.

https://lxml.de/lxmlhtml.html

I ditched bs for lxml long ago and never regretted it.

If you find that you have a bunch of invalid html that lxml inadvertently
"fixes", I would recommend adding a stutter-step to your project: perform a
noop roundtrip thru lxml on all files. I'd then analyze any diff by
progressively excluding changes via `grep -vP`.
Unless I'm mistaken, all such changes should fall into no more than a dozen
groups.




On Fri, Aug 19, 2022, 1:34 PM Chris Angelico  wrote:

> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
>
> Using the Alice example from the BS4 docs:
>
> >>> html_doc = """The Dormouse's story
> 
> The Dormouse's story
>
> Once upon a time there were three little sisters; and
> their names were
> http://example.com/elsie; class="sister" id="link1">Elsie,
> http://example.com/lacie; class="sister" id="link2">Lacie and
> http://example.com/tillie; class="sister" id="link3">Tillie;
> and they lived at the bottom of a well.
>
> ...
> """
> >>> print(soup)
> The Dormouse's story
> 
> The Dormouse's story
> Once upon a time there were three little sisters; and
> their names were
> http://example.com/elsie; id="link1">Elsie,
> http://example.com/lacie; id="link2">Lacie and
> http://example.com/tillie; id="link3">Tillie;
> and they lived at the bottom of a well.
> ...
> 
> >>>
>
> Note two distinct changes: firstly, whitespace has been removed, and
> secondly, attributes are reordered (I think alphabetically). There are
> other canonicalizations being done, too.
>
> I'm trying to make some automated changes to a huge number of HTML
> files, with minimal diffs so they're easy to validate. That means that
> spurious changes like these are very much unwanted. Is there a way to
> get BS4 to reconstruct the original precisely?
>
> The mutation itself would be things like finding an anchor tag and
> changing its href attribute. Fairly simple changes, but might alter
> the length of the file (eg changing "http://example.com/; into
> "https://example.com/;). I'd like to do them intelligently rather than
> falling back on element.sourceline and element.sourcepos, but worst
> case, that's what I'll have to do (which would be fiddly).
>
> ChrisA
> --
> https://mail.python.org/mailman/listinfo/python-list
>
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: New Python implementation

2021-02-15 Thread Buck Evan
On Thu, Feb 11, 2021 at 1:49 PM dn via Python-list 
wrote:

> When I first met it, one of the concepts I found difficult to 'wrap my
> head around' was the idea that "open software" allowed folk to fork the
> original work and 'do their own thing'. My thinking was (probably)
> "surely, the original is the authoritative version". Having other
> versions seemed an invitation to confusion and dilution.
>
> However, as soon as (open) software is made available, other people
> start making it 'better' - whatever their own definition of "better".
>
> Yes, it is both a joy and a complication.
>
> ...
>
> Wishing you well. It seems (to (neos-ignorant) me at least) an ambitious
> project. There are certainly times when 'execution speed' becomes a
> major criteria. Many of us will look forward to (your development of) a
> solution. Please let us know when it's ready for use/trials...
>

Well put! Thank you for this thoughtful and informative message. You
obviously put substantial work into it.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Explicit vararg values

2018-09-22 Thread Buck Evan
Received?

On Sun, Sep 16, 2018 at 3:39 PM Buck Evan  wrote:

> I started to send this to python-ideas, but I'm having second thoughts.
> Does tihs have merit?
>
> ---
> I stumble on this a lot, and I see it in many python libraries:
>
> def f(*args, **kwargs):
> ...
>
> f(*[list comprehension])
> f(**mydict)
>
> It always seems a shame to carefully build up an object in order to
> explode it, just to pack it into a near-identical object.
>
> Today I was fiddling with the new python3.7 inspect.signature
> functionality when I ran into this case:
>
> def f(**kwargs): pass
> sig = inspect.signature(f)
> print(sig.bind(a=1, b=2))
>
> The output is "". I found this a
> bit humorous since anyone attempting to bind values in this way, using
> f(kwargs={'a': 1, 'b': 2}) will be sorely dissappointed. I also wondered
> why BoundArguments didn't print '**kwargs' since that's the __str__ of that
> parameter object.
>
> The syntax I'm proposing is:
>f(**kwargs={'a': 1, 'b': 2})
>
> as a synonym of f(a=1, b=2) when an appropriate dictionary is already on
> hand.
>
> ---
> I can argue for this another way as well.
>
> 1)
> When both caller and callee have a known number of values to pass/receive,
> that's the usual syntax:
> def f(x) and f(1)
>
> 2)
> When the caller has a fixed set of values, but the callee wants to handle
> a variable number:   def f(*args) and f(1)
>
> 3)
> Caller has a variable number of arguments (varargs) but the call-ee is
> fixed, that's the splat operator: def f(x) and f(*args)
>
> 4)
> When case 1 and 3 cross paths, and we have a vararg in both the caller and
> callee, right now we're forced to splat both sides: def f(*args) and
> f(*args), but I'd like the option of opting-in to passing along my list
> as-is with no splat or collection operations involved: def f(*args) and
> f(*args=args)
>
> Currently the pattern to handle case 4 neatly is to define two versions of
> a vararg function:
>
> def f(*arg, **kwargs):
> return _f(args, kwargs)
>
> return _f(args, kwargs):
> ...
>
> Such that when internal calllers hit case 4, there's a simple and
> efficient way forward -- use the internal de-vararg'd  definition of f.
> External callers have no such option though, without breaking protected api
> convention.
>
> My proposal would simplify this implementation as well as allowing users
> to make use of a similar calling convention that was only provided
> privately before.
>
> Examples:
>
> log(*args) and _log(args) in logging.Logger
> format and vformat of strings.Formatter
>
-- 
https://mail.python.org/mailman/listinfo/python-list


Explicit vararg values

2018-09-17 Thread Buck Evan
I started to send this to python-ideas, but I'm having second thoughts.
Does tihs have merit?

---
I stumble on this a lot, and I see it in many python libraries:

def f(*args, **kwargs):
...

f(*[list comprehension])
f(**mydict)

It always seems a shame to carefully build up an object in order to explode
it, just to pack it into a near-identical object.

Today I was fiddling with the new python3.7 inspect.signature functionality
when I ran into this case:

def f(**kwargs): pass
sig = inspect.signature(f)
print(sig.bind(a=1, b=2))

The output is "". I found this a
bit humorous since anyone attempting to bind values in this way, using
f(kwargs={'a': 1, 'b': 2}) will be sorely dissappointed. I also wondered
why BoundArguments didn't print '**kwargs' since that's the __str__ of that
parameter object.

The syntax I'm proposing is:
   f(**kwargs={'a': 1, 'b': 2})

as a synonym of f(a=1, b=2) when an appropriate dictionary is already on
hand.

---
I can argue for this another way as well.

1)
When both caller and callee have a known number of values to pass/receive,
that's the usual syntax:
def f(x) and f(1)

2)
When the caller has a fixed set of values, but the callee wants to handle a
variable number:   def f(*args) and f(1)

3)
Caller has a variable number of arguments (varargs) but the call-ee is
fixed, that's the splat operator: def f(x) and f(*args)

4)
When case 1 and 3 cross paths, and we have a vararg in both the caller and
callee, right now we're forced to splat both sides: def f(*args) and
f(*args), but I'd like the option of opting-in to passing along my list
as-is with no splat or collection operations involved: def f(*args) and
f(*args=args)

Currently the pattern to handle case 4 neatly is to define two versions of
a vararg function:

def f(*arg, **kwargs):
return _f(args, kwargs)

return _f(args, kwargs):
...

Such that when internal calllers hit case 4, there's a simple and efficient
way forward -- use the internal de-vararg'd  definition of f. External
callers have no such option though, without breaking protected api
convention.

My proposal would simplify this implementation as well as allowing users to
make use of a similar calling convention that was only provided privately
before.

Examples:

log(*args) and _log(args) in logging.Logger
format and vformat of strings.Formatter
-- 
https://mail.python.org/mailman/listinfo/python-list


[issue34706] Signature.from_callable sometimes drops subclassing

2018-09-16 Thread Buck Evan


Change by Buck Evan :


--
type:  -> behavior

___
Python tracker 
<https://bugs.python.org/issue34706>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34706] Signature.from_callable sometimes drops subclassing

2018-09-16 Thread Buck Evan


New submission from Buck Evan :

Specifically in the case of a class that does not override its constructor 
signature inherited from object.

Github PR incoming shortly.

--
components: Library (Lib)
messages: 325501
nosy: bukzor
priority: normal
severity: normal
status: open
title: Signature.from_callable sometimes drops subclassing
versions: Python 3.7

___
Python tracker 
<https://bugs.python.org/issue34706>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24085] large memory overhead when pyc is recompiled

2015-05-04 Thread Buck Evan

Buck Evan added the comment:

@serhiy.storchaka This is a very stable piece of a legacy code base, so we're 
not keen to refactor it so dramatically, although we could. 

We've worked around this issue by compiling pyc files ahead of time and taking 
extra care that they're preserved through deployment. This isn't blocking our 
2.7 transition anymore.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24085
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24085] large memory overhead when pyc is recompiled

2015-05-01 Thread Buck Evan

Buck Evan added the comment:

New data: The memory consumption seems to be in the compiler rather than the 
marshaller:


```
$ PYTHONDONTWRITEBYTECODE=1 python -c 'import repro'
16032
$ PYTHONDONTWRITEBYTECODE=1 python -c 'import repro'
16032
$ PYTHONDONTWRITEBYTECODE=1 python -c 'import repro'
16032

$ python -c 'import repro'
16032

$ PYTHONDONTWRITEBYTECODE=1 python -c 'import repro'
8984
$ PYTHONDONTWRITEBYTECODE=1 python -c 'import repro'
8984
$ PYTHONDONTWRITEBYTECODE=1 python -c 'import repro'
8984
```

We were trying to use PYTHONDONTWRITEBYTECODE as a workaround to this issue, 
but it didn't help us because of this.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24085
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24085] large memory overhead when pyc is recompiled

2015-04-30 Thread Buck Evan

New submission from Buck Evan:

In the attached example I show that there's a significant memory overhead 
present whenever a pre-compiled pyc is not present.

This only occurs with more than 5225 objects (dictionaries in this case)
allocated. At 13756 objects, the mysterious pyc overhead is 50% of memory
usage.

I've reproduced this issue in python 2.6, 2.7, 3.4. I imagine it's present in 
all cpythons.


$ python -c 'import repro'
16736
$ python -c 'import repro'
8964
$ python -c 'import repro'
8964

$ rm *.pyc; python -c 'import repro'
16740
$ rm *.pyc; python -c 'import repro'
16736
$ rm *.pyc; python -c 'import repro'
16740

--
files: repro.py
messages: 242281
nosy: bukzor
priority: normal
severity: normal
status: open
title: large memory overhead when pyc is recompiled
versions: Python 3.4
Added file: http://bugs.python.org/file39238/repro.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24085
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24085] large memory overhead when pyc is recompiled

2015-04-30 Thread Buck Evan

Buck Evan added the comment:

Also, we've reproduced this in both linux and osx.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24085
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com