On 2022-08-22 19:27:28 -, Jon Ribbens via Python-list wrote:
> On 2022-08-22, Peter J. Holzer wrote:
> > On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote:
> >> With the offset though, BeautifulSoup made an arbitrary decision to
> >> use ISO-8859-1 encoding and so when you
On 2022-08-22, Peter J. Holzer wrote:
> On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote:
>> With the offset though, BeautifulSoup made an arbitrary decision to
>> use ISO-8859-1 encoding and so when you chopped the bytestring at
>> that offset it only worked because BeautifulSoup
On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote:
> With the offset though, BeautifulSoup made an arbitrary decision to
> use ISO-8859-1 encoding and so when you chopped the bytestring at
> that offset it only worked because BeautifulSoup had happened to
> choose a
On 2022-08-22 00:09:01 -, Jon Ribbens via Python-list wrote:
> On 2022-08-21, Peter J. Holzer wrote:
> > On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote:
> >> result = re.sub(
> >> r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""",
> >
> > This will fail on:
> >
On 2022-08-21, Chris Angelico wrote:
> On Mon, 22 Aug 2022 at 05:43, Jon Ribbens via Python-list
> wrote:
>> On 2022-08-21, Chris Angelico wrote:
>> > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
>> > wrote:
>> >> On 2022-08-20, Chris Angelico wrote:
>> >> > On Sun, 21 Aug 2022 at
On 2022-08-21, Peter J. Holzer wrote:
> On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote:
>> On 2022-08-20, Stefan Ram wrote:
>> > Jon Ribbens writes:
>> >>... or you could avoid all that faff and just do re.sub()?
>
>> > source = ''
>> >
>> > # Use Python to change the source,
On 22/08/2022 05:30, Chris Angelico wrote:
On Mon, 22 Aug 2022 at 10:04, Buck Evan wrote:
I've had much success doing round trips through the lxml.html parser.
https://lxml.de/lxmlhtml.html
I ditched bs for lxml long ago and never regretted it.
If you find that you have a bunch of invalid
On Mon, 22 Aug 2022 at 10:04, Buck Evan wrote:
>
> I've had much success doing round trips through the lxml.html parser.
>
> https://lxml.de/lxmlhtml.html
>
> I ditched bs for lxml long ago and never regretted it.
>
> If you find that you have a bunch of invalid html that lxml inadvertently
>
I've had much success doing round trips through the lxml.html parser.
https://lxml.de/lxmlhtml.html
I ditched bs for lxml long ago and never regretted it.
If you find that you have a bunch of invalid html that lxml inadvertently
"fixes", I would recommend adding a stutter-step to your project:
On Mon, 22 Aug 2022 at 05:43, Jon Ribbens via Python-list
wrote:
>
> On 2022-08-21, Chris Angelico wrote:
> > On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
> > wrote:
> >> On 2022-08-20, Chris Angelico wrote:
> >> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote:
> >> >>
On 2022-08-21, Chris Angelico wrote:
> On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
> wrote:
>> On 2022-08-20, Chris Angelico wrote:
>> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote:
>> >> 2qdxy4rzwzuui...@potatochowder.com writes:
>> >> >textual representations. That way, the
On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote:
> On 2022-08-20, Stefan Ram wrote:
> > Jon Ribbens writes:
> >>... or you could avoid all that faff and just do re.sub()?
> > source = ''
> >
> > # Use Python to change the source, keeping the order of attributes.
> >
> > result =
> On 21 Aug 2022, at 09:12, Chris Angelico wrote:
>
> On Sun, 21 Aug 2022 at 17:26, Barry wrote:
>>
>>
>>
On 19 Aug 2022, at 22:04, Chris Angelico wrote:
>>>
>>> On Sat, 20 Aug 2022 at 05:12, Barry wrote:
>> On 19 Aug 2022, at 19:33, Chris Angelico wrote:
On Sun, 21 Aug 2022 at 17:26, Barry wrote:
>
>
>
> > On 19 Aug 2022, at 22:04, Chris Angelico wrote:
> >
> > On Sat, 20 Aug 2022 at 05:12, Barry wrote:
> >>
> >>
> >>
> On 19 Aug 2022, at 19:33, Chris Angelico wrote:
> >>>
> >>> What's the best way to precisely reconstruct an HTML file
> On 19 Aug 2022, at 22:04, Chris Angelico wrote:
>
> On Sat, 20 Aug 2022 at 05:12, Barry wrote:
>>
>>
>>
On 19 Aug 2022, at 19:33, Chris Angelico wrote:
>>>
>>> What's the best way to precisely reconstruct an HTML file after
>>> parsing it with BeautifulSoup?
>>
>> I recall that
On Sun, 21 Aug 2022 at 13:41, dn wrote:
>
> On 21/08/2022 13.00, Chris Angelico wrote:
> > Well, I don't like headaches, but I do appreciate what the G Archive
> > has given me over the years, so I'm taking this on as a means of
> > giving back to the community.
>
> This point will be picked-up
On 21/08/2022 13.00, Chris Angelico wrote:
> On Sun, 21 Aug 2022 at 09:48, dn wrote:
>> On 20/08/2022 12.38, Chris Angelico wrote:
>>> On Sat, 20 Aug 2022 at 10:19, dn wrote:
On 20/08/2022 09.01, Chris Angelico wrote:
> On Sat, 20 Aug 2022 at 05:12, Barry wrote:
>>> On 19 Aug 2022,
On Sun, 21 Aug 2022 at 09:48, dn wrote:
>
> On 20/08/2022 12.38, Chris Angelico wrote:
> > On Sat, 20 Aug 2022 at 10:19, dn wrote:
> >> On 20/08/2022 09.01, Chris Angelico wrote:
> >>> On Sat, 20 Aug 2022 at 05:12, Barry wrote:
> > On 19 Aug 2022, at 19:33, Chris Angelico wrote:
> >
>
On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
wrote:
>
> On 2022-08-20, Chris Angelico wrote:
> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote:
> >> 2qdxy4rzwzuui...@potatochowder.com writes:
> >> >textual representations. That way, the following two elements are the
> >> >same
On 20/08/2022 12.38, Chris Angelico wrote:
> On Sat, 20 Aug 2022 at 10:19, dn wrote:
>> On 20/08/2022 09.01, Chris Angelico wrote:
>>> On Sat, 20 Aug 2022 at 05:12, Barry wrote:
> On 19 Aug 2022, at 19:33, Chris Angelico wrote:
>
> What's the best way to precisely reconstruct an
On 2022-08-20, Stefan Ram wrote:
> Jon Ribbens writes:
>>... or you could avoid all that faff and just do re.sub()?
>
> import bs4
> import re
>
> source = ''
>
> # Use Python to change the source, keeping the order of attributes.
>
> result = re.sub( r'href\s*=\s*"http"', r'href="https"',
On 2022-08-20, Chris Angelico wrote:
> On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote:
>> 2qdxy4rzwzuui...@potatochowder.com writes:
>> >textual representations. That way, the following two elements are the
>> >same (and similar with a collection of sub-elements in a different order
>> >in
On Sun, 21 Aug 2022 at 03:27, Stefan Ram wrote:
>
> 2qdxy4rzwzuui...@potatochowder.com writes:
> >textual representations. That way, the following two elements are the
> >same (and similar with a collection of sub-elements in a different order
> >in another document):
>
> The /elements/
On 2022-08-19, Chris Angelico wrote:
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
>
> Using the Alice example from the BS4 docs:
>
html_doc = """The Dormouse's story
>
>The Dormouse's story
>
>Once upon a time there were three little
On Sat, 20 Aug 2022 at 10:19, dn wrote:
>
> On 20/08/2022 09.01, Chris Angelico wrote:
> > On Sat, 20 Aug 2022 at 05:12, Barry wrote:
> >>
> >>
> >>
> >>> On 19 Aug 2022, at 19:33, Chris Angelico wrote:
> >>>
> >>> What's the best way to precisely reconstruct an HTML file after
> >>> parsing
On 20/08/2022 09.01, Chris Angelico wrote:
> On Sat, 20 Aug 2022 at 05:12, Barry wrote:
>>
>>
>>
>>> On 19 Aug 2022, at 19:33, Chris Angelico wrote:
>>>
>>> What's the best way to precisely reconstruct an HTML file after
>>> parsing it with BeautifulSoup?
>>
>> I recall that in bs4 it parses
On Sat, 20 Aug 2022 at 10:04, David wrote:
>
> On Sat, 20 Aug 2022 at 04:31, Chris Angelico wrote:
>
> > What's the best way to precisely reconstruct an HTML file after
> > parsing it with BeautifulSoup?
>
> > Note two distinct changes: firstly, whitespace has been removed, and
> > secondly,
On Sat, 20 Aug 2022 at 04:31, Chris Angelico wrote:
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
> Note two distinct changes: firstly, whitespace has been removed, and
> secondly, attributes are reordered (I think alphabetically). There are
On Sat, 20 Aug 2022 at 05:12, Barry wrote:
>
>
>
> > On 19 Aug 2022, at 19:33, Chris Angelico wrote:
> >
> > What's the best way to precisely reconstruct an HTML file after
> > parsing it with BeautifulSoup?
>
> I recall that in bs4 it parses into an object tree and loses the detail of
> the
On 2022-08-19 at 20:12:35 +0100,
Barry wrote:
> > On 19 Aug 2022, at 19:33, Chris Angelico wrote:
> >
> > What's the best way to precisely reconstruct an HTML file after
> > parsing it with BeautifulSoup?
>
> I recall that in bs4 it parses into an object tree and loses the
> detail of the
> On 19 Aug 2022, at 19:33, Chris Angelico wrote:
>
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
I recall that in bs4 it parses into an object tree and loses the detail of the
input.
I recently ported from very old bs to bs4 and hit the
What's the best way to precisely reconstruct an HTML file after
parsing it with BeautifulSoup?
Using the Alice example from the BS4 docs:
>>> html_doc = """The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and
their names were
http://example.com/elsie;
32 matches
Mail list logo