Re: Mutating an HTML file with BeautifulSoup

2022-08-23 Thread Peter J. Holzer
On 2022-08-22 19:27:28 -, Jon Ribbens via Python-list wrote: > On 2022-08-22, Peter J. Holzer wrote: > > On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote: > >> With the offset though, BeautifulSoup made an arbitrary decision to > >> use ISO-8859

Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Jon Ribbens via Python-list
On 2022-08-22, Peter J. Holzer wrote: > On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote: >> With the offset though, BeautifulSoup made an arbitrary decision to >> use ISO-8859-1 encoding and so when you chopped the bytestring at >> that offset it only worked

Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Peter J. Holzer
On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote: > With the offset though, BeautifulSoup made an arbitrary decision to > use ISO-8859-1 encoding and so when you chopped the bytestring at > that offset it only worked because BeautifulSoup had happened to > choose

Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Peter J. Holzer
On 2022-08-22 00:09:01 -, Jon Ribbens via Python-list wrote: > On 2022-08-21, Peter J. Holzer wrote: > > On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote: > >> result = re.sub( > >> r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""", > > > > This will fail on: > > >

Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Jon Ribbens via Python-list
> > >> >> > That's the bit that's hard to prove. >> >> > >> >> >> The OP could edit the files with regexps to create a new version. >> >> > >> >> > To you and Jon, who also suggested this: how would that be ben

Re: Mutating an HTML file with BeautifulSoup

2022-08-22 Thread Jon Ribbens via Python-list
On 2022-08-21, Peter J. Holzer wrote: > On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote: >> On 2022-08-20, Stefan Ram wrote: >> > Jon Ribbens writes: >> >>... or you could avoid all that faff and just do re.sub()? > >> > source = '' >> > >> > # Use Python to change the source, ke

Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Peter Otten
t that attributes are reorderd: >>> import bs4 >>> soup = bs4.BeautifulSoup("""""") >>> soup >>> class Formatter(bs4.formatter.HTMLFormatter): def attributes(self, tag): return [] if tag.attrs is None else list(tag.attrs.items(

Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Chris Angelico
On Mon, 22 Aug 2022 at 10:04, Buck Evan wrote: > > I've had much success doing round trips through the lxml.html parser. > > https://lxml.de/lxmlhtml.html > > I ditched bs for lxml long ago and never regretted it. > > If you find that you have a bunch of invalid html that lxml inadvertently > "fi

Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Buck Evan
> What's the best way to precisely reconstruct an HTML file after > parsing it with BeautifulSoup? > > Using the Alice example from the BS4 docs: > > >>> html_doc = """The Dormouse's story > > The Dormouse's story > > Once upon a t

Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Chris Angelico
ould edit the files with regexps to create a new version. > >> > > >> > To you and Jon, who also suggested this: how would that be beneficial? > >> > With Beautiful Soup, I have the line number and position within the > >> > line where the tag start

Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Jon Ribbens via Python-list
With Beautiful Soup, I have the line number and position within the >> > line where the tag starts; what does a regex give me that I don't have >> > that way? >> >> You mean you could use BeautifulSoup to read the file and identify the >> bits you want to chang

Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Peter J. Holzer
On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote: > On 2022-08-20, Stefan Ram wrote: > > Jon Ribbens writes: > >>... or you could avoid all that faff and just do re.sub()? > > source = '' > > > > # Use Python to change the source, keeping the order of attributes. > > > > result =

Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Barry
; >>>> >>>> >>>>>> On 19 Aug 2022, at 19:33, Chris Angelico wrote: >>>>> >>>>> What's the best way to precisely reconstruct an HTML file after >>>>> parsing it with BeautifulSoup? >>>> >&

Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Chris Angelico
gt; >>> > >>> What's the best way to precisely reconstruct an HTML file after > >>> parsing it with BeautifulSoup? > >> > >> I recall that in bs4 it parses into an object tree and loses the detail of > >> the input. > >> I

Re: Mutating an HTML file with BeautifulSoup

2022-08-21 Thread Barry
> On 19 Aug 2022, at 22:04, Chris Angelico wrote: > > On Sat, 20 Aug 2022 at 05:12, Barry wrote: >> >> >> >>>> On 19 Aug 2022, at 19:33, Chris Angelico wrote: >>> >>> What's the best way to precisely reconstruct an HTML file

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Chris Angelico
On Sun, 21 Aug 2022 at 13:41, dn wrote: > > On 21/08/2022 13.00, Chris Angelico wrote: > > Well, I don't like headaches, but I do appreciate what the G&S Archive > > has given me over the years, so I'm taking this on as a means of > > giving back to the community. > > This point will be picked-up

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread dn
On 21/08/2022 13.00, Chris Angelico wrote: > On Sun, 21 Aug 2022 at 09:48, dn wrote: >> On 20/08/2022 12.38, Chris Angelico wrote: >>> On Sat, 20 Aug 2022 at 10:19, dn wrote: On 20/08/2022 09.01, Chris Angelico wrote: > On Sat, 20 Aug 2022 at 05:12, Barry wrote: >>> On 19 Aug 2022,

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Chris Angelico
gt; On 19 Aug 2022, at 19:33, Chris Angelico wrote: > >>>>> > >>>>> What's the best way to precisely reconstruct an HTML file after > >>>>> parsing it with BeautifulSoup? > ... > > >>> well. Thanks for trying, anyhow. > >>&g

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Chris Angelico
. > > > >> The OP could edit the files with regexps to create a new version. > > > > To you and Jon, who also suggested this: how would that be beneficial? > > With Beautiful Soup, I have the line number and position within the > > line where the tag star

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread dn
t;>>> What's the best way to precisely reconstruct an HTML file after >>>>> parsing it with BeautifulSoup? ... >>> well. Thanks for trying, anyhow. >>> >>> So I'm left with a few options: >>> >>> 1) Give up on valid

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Jon Ribbens via Python-list
= re.sub( r'href\s*=\s*"http"', r'href="https"', source ) > result = re.sub( r"href\s*=\s*'http'", r"href='https'", result ) You could go a bit harder with the regexp of course, e.g.: result = re.sub( r""

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Jon Ribbens via Python-list
suggested this: how would that be beneficial? > With Beautiful Soup, I have the line number and position within the > line where the tag starts; what does a regex give me that I don't have > that way? You mean you could use BeautifulSoup to read the file and identify the bits you wan

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Chris Angelico
and position within the line where the tag starts; what does a regex give me that I don't have that way? > Soup := BeautifulSoup. > > Then have Soup read both the new version and the old version. > > Then have Soup also edit the old version read in, the same way as &g

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Jon Ribbens via Python-list
On 2022-08-19, Chris Angelico wrote: > What's the best way to precisely reconstruct an HTML file after > parsing it with BeautifulSoup? > > Using the Alice example from the BS4 docs: > >>>> html_doc = """The Dormouse's story > >The Do

Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread Chris Angelico
at's the best way to precisely reconstruct an HTML file after > >>> parsing it with BeautifulSoup? > >> > >> I recall that in bs4 it parses into an object tree and loses the detail of > >> the input. > >> I recently ported from very old bs to bs4

Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread dn
On 20/08/2022 09.01, Chris Angelico wrote: > On Sat, 20 Aug 2022 at 05:12, Barry wrote: >> >> >> >>> On 19 Aug 2022, at 19:33, Chris Angelico wrote: >>> >>> What's the best way to precisely reconstruct an HTML file after >>> parsi

Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread Chris Angelico
On Sat, 20 Aug 2022 at 10:04, David wrote: > > On Sat, 20 Aug 2022 at 04:31, Chris Angelico wrote: > > > What's the best way to precisely reconstruct an HTML file after > > parsing it with BeautifulSoup? > > > Note two distinct changes: firstly, whitespace

Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread David
On Sat, 20 Aug 2022 at 04:31, Chris Angelico wrote: > What's the best way to precisely reconstruct an HTML file after > parsing it with BeautifulSoup? > Note two distinct changes: firstly, whitespace has been removed, and > secondly, attributes are reordered (I think alphabeti

Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread Chris Angelico
On Sat, 20 Aug 2022 at 05:12, Barry wrote: > > > > > On 19 Aug 2022, at 19:33, Chris Angelico wrote: > > > > What's the best way to precisely reconstruct an HTML file after > > parsing it with BeautifulSoup? > > I recall that in bs4 it parses into a

Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread 2QdxY4RzWzUUiLuE
On 2022-08-19 at 20:12:35 +0100, Barry wrote: > > On 19 Aug 2022, at 19:33, Chris Angelico wrote: > > > > What's the best way to precisely reconstruct an HTML file after > > parsing it with BeautifulSoup? > > I recall that in bs4 it parses into an object

Re: Mutating an HTML file with BeautifulSoup

2022-08-19 Thread Barry
> On 19 Aug 2022, at 19:33, Chris Angelico wrote: > > What's the best way to precisely reconstruct an HTML file after > parsing it with BeautifulSoup? I recall that in bs4 it parses into an object tree and loses the detail of the input. I recently ported from very old bs t

Mutating an HTML file with BeautifulSoup

2022-08-19 Thread Chris Angelico
What's the best way to precisely reconstruct an HTML file after parsing it with BeautifulSoup? Using the Alice example from the BS4 docs: >>> html_doc = """The Dormouse's story The Dormouse's story Once upon a time there were three little sisters; and th

Re: how to obtain the text for BeautifulSoup object

2018-03-20 Thread Peter Otten
Nathan Zhu wrote: > Hi Team, > > could anyone help me? > > for webpage having source code like this: > ... > > number > name > > > I only can use below sentence, since there are a lot of tag em and tag a > in other area. > output = >

how to obtain the text for BeautifulSoup object

2018-03-19 Thread Nathan Zhu
Hi Team, could anyone help me? for webpage having source code like this: ... number name I only can use below sentence, since there are a lot of tag em and tag a in other area. output = bs4.BeautifulSoup(res.content,'lxml').findAll("span",{"class":&quo

Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Christopher Reimer via Python-list
Ah, shoot me. I had a .join() statement on the output queue but not on in the input queue. So the threads for the input queue got terminated before BeautifulSoup could get started. I went down that same rabbit hole with CSVWriter the other day. *sigh* Thanks for everyone's help. Ch

Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Peter Otten
ead_threads requesting and putting pages into the output > queue that is the input_queue for the parser. My soup_threads can get > items from the queue, but BeautifulSoup doesn't do anything after that. > > Chris R. -- https://mail.python.org/mailman/listinfo/python-list

Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Paul Rubin
Christopher Reimer writes: > I have 20 read_threads requesting and putting pages into the output > queue that is the input_queue for the parser. Given how slow parsing is, you probably want to scrap the pages into disk files, and then run the parser in parallel processes that read from the disk.

Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Christopher Reimer via Python-list
On 8/27/2017 1:50 PM, MRAB wrote: What if you don't sort the list? I ask because it sounds like you're changing 2 variables (i.e. list->queue, sorted->unsorted) at the same time, so you can't be sure that it's the queue that's the problem. If I'm using a list, I'm using a for loop to input ite

Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Christopher Reimer via Python-list
20 read_threads requesting and putting pages into the output queue that is the input_queue for the parser. My soup_threads can get items from the queue, but BeautifulSoup doesn't do anything after that. Chris R. -- https://mail.python.org/mailman/listinfo/python-list

Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread MRAB
On 2017-08-27 21:35, Christopher Reimer via Python-list wrote: On 8/27/2017 1:12 PM, MRAB wrote: What do you mean by "queue (random order)"? A queue is sequential order, first-in-first-out. With 20 threads requesting 20 different pages, they're not going into the queue in sequential order (i

Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Christopher Reimer via Python-list
On 8/27/2017 1:12 PM, MRAB wrote: What do you mean by "queue (random order)"? A queue is sequential order, first-in-first-out. With 20 threads requesting 20 different pages, they're not going into the queue in sequential order (i.e., 0, 1, 2, ..., 17, 18, 19) and coming in at different time

Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Peter Otten
Christopher Reimer via Python-list wrote: > On 8/27/2017 11:54 AM, Peter Otten wrote: > >> The documentation >> >> https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup >> >> says you can make the BeautifulSoup object from a string or file

Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread MRAB
On 2017-08-27 20:35, Christopher Reimer via Python-list wrote: On 8/27/2017 11:54 AM, Peter Otten wrote: The documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup says you can make the BeautifulSoup object from a string or file. Can you give a few more details

Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Christopher Reimer via Python-list
On 8/27/2017 11:54 AM, Peter Otten wrote: The documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup says you can make the BeautifulSoup object from a string or file. Can you give a few more details where the queue comes into play? A small code sample would be

Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Peter Otten
le thread) > > It takes 15 minutes to process ~11,000 comments. > > When I replaced the list with a queue between the Requestor and Parser > to speed up things, BeautifulSoup stopped working. > > When I changed BeautifulSoup(contents, "lxml") to > BeautifulSoup(conte

BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Christopher Reimer via Python-list
d the list with a queue between the Requestor and Parser to speed up things, BeautifulSoup stopped working. When I changed BeautifulSoup(contents, "lxml") to BeautifulSoup(contents), I get the UserWarning that no parser wasn't explicitly set and a reference to line 80 in threading

Re: Python BeautifulSoup extract html table cells that contains images and text

2017-07-29 Thread Piet van Oostrum
Umar Yusuf writes: > Hi all, > > I need help extracting the table from this url...? > > from bs4 import BeautifulSoup > url = "https://www.marinetraffic.com/en/ais/index/ports/all/per_page:50"; > > headers = {'User-agent': 'Mozilla/5.0

Python BeautifulSoup extract html table cells that contains images and text

2017-07-25 Thread Umar Yusuf
Hi all, I need help extracting the table from this url...? from bs4 import BeautifulSoup url = "https://www.marinetraffic.com/en/ais/index/ports/all/per_page:50"; headers = {'User-agent': 'Mozilla/5.0'} raw_html = requests.get(url, headers=headers) raw_

Re: Delete h2 until you reach the next h2 in beautifulsoup

2016-11-06 Thread rosefox911
On Sunday, November 6, 2016 at 1:27:48 AM UTC-4, rosef...@gmail.com wrote: > Considering the following html: > > cool stuff hiid="cool"> zz > > and the following list: > > ignore_list = ['example','lalala'] > > My goal

Delete h2 until you reach the next h2 in beautifulsoup

2016-11-05 Thread rosefox911
Considering the following html: cool stuff hizz and the following list: ignore_list = ['example','lalala'] My goal is, while going through the HTML using Beautifulsoup, I find a h2 that has an ID that is in my list (ignore_list) I should delete all the

Re: BeautifulSoup help !!

2016-10-09 Thread alister
On Fri, 07 Oct 2016 03:12:32 +1100, Steve D'Aprano wrote: > On Fri, 7 Oct 2016 02:30 am, alister wrote: > >> On Thu, 06 Oct 2016 08:22:05 -0700, desolate.soul.me wrote: >> >>> So I've just started up with python and an assignment was given to me >>> by a company as an recruitment task. >>> >> so

Re: BeautifulSoup help !!

2016-10-07 Thread Navneet Siddhant
i was able to almost get the results i require but only half of the info is showing up. i.e suppose there are 20 coupons showing on the page but when i print them using the container they are in only 5 are printed on the screen. Also figured out how to write to csv file , but even here only 1 ro

Re: BeautifulSoup help !!

2016-10-07 Thread Pierre-Alain Dorange
again I dont need complete code , a > resource where I could find more info about using Beautifulsoup will be > appreciated. Also do I need some kind of plugin etc to extract data to > csv ? or it is built in python and I could simply import csv and write > other commands needed ?? Beaut

Re: BeautifulSoup help !!

2016-10-06 Thread Michael Torrie
h it. Let me clarify once again I dont need > complete code , a resource where I could find more info about using > Beautifulsoup will be appreciated. Also do I need some kind of > plugin etc to extract data to csv ? or it is built in python and I > could simply import csv and write other com

Re: BeautifulSoup help !!

2016-10-06 Thread Navneet Siddhant
where I could find more info about using Beautifulsoup will be appreciated. Also do I need some kind of plugin etc to extract data to csv ? or it is built in python and I could simply import csv and write other commands needed ?? -- https://mail.python.org/mailman/listinfo/python-list

Re: BeautifulSoup help !!

2016-10-06 Thread Chris Angelico
On Fri, Oct 7, 2016 at 4:00 AM, Navneet Siddhant wrote: > I guess I shouldnt have mentioned as this was a recruitment task. If needed I > can post a screenshot of the mail I got which says I can take help from > anywhere possible as long as the assignment is done. Wont be simply copying > pasti

Re: BeautifulSoup help !!

2016-10-06 Thread Chris Angelico
On Fri, Oct 7, 2016 at 3:38 AM, Steve D'Aprano wrote: > On Fri, 7 Oct 2016 03:00 am, Chris Angelico wrote: > >> You are asking >> for assistance with something that was assigned to you *as a >> recruitment task*. Were you told that asking for help was a legitimate >> solution? > > Why should he ne

Re: BeautifulSoup help !!

2016-10-06 Thread Navneet Siddhant
onraja.in and export it to csv format. > The details which I need to be present in the csv are the coupon title , > vendor , validity , description/detail , url to the vendor , image url of the > coupon. > > I have gone through many tutorials on beautifulsoup and have a beginners

Re: BeautifulSoup help !!

2016-10-06 Thread Steve D'Aprano
On Fri, 7 Oct 2016 03:00 am, Chris Angelico wrote: > You are asking > for assistance with something that was assigned to you *as a > recruitment task*. Were you told that asking for help was a legitimate > solution? Why should he need to be told that? Asking for help *is* a legitimate solution, j

Re: BeautifulSoup help !!

2016-10-06 Thread Navneet Siddhant
On Thursday, October 6, 2016 at 9:57:46 PM UTC+5:30, Navneet Siddhant wrote: > On Thursday, October 6, 2016 at 9:42:47 PM UTC+5:30, Steve D'Aprano wrote: > > On Fri, 7 Oct 2016 02:30 am, alister wrote: > > > > > On Thu, 06 Oct 2016 08:22:05 -0700, desolate.soul.me wrote: > > > > > >> So I've just

Re: BeautifulSoup help !!

2016-10-06 Thread Navneet Siddhant
On Thursday, October 6, 2016 at 9:42:47 PM UTC+5:30, Steve D'Aprano wrote: > On Fri, 7 Oct 2016 02:30 am, alister wrote: > > > On Thu, 06 Oct 2016 08:22:05 -0700, desolate.soul.me wrote: > > > >> So I've just started up with python and an assignment was given to me by > >> a company as an recruit

Re: BeautifulSoup help !!

2016-10-06 Thread Noah
+1 at Steve On 6 Oct 2016 19:17, "Steve D'Aprano" wrote: > On Fri, 7 Oct 2016 02:30 am, alister wrote: > > > On Thu, 06 Oct 2016 08:22:05 -0700, desolate.soul.me wrote: > > > >> So I've just started up with python and an assignment was given to me by > >> a company as an recruitment task. >

Re: BeautifulSoup help !!

2016-10-06 Thread Steve D'Aprano
On Fri, 7 Oct 2016 02:30 am, alister wrote: > On Thu, 06 Oct 2016 08:22:05 -0700, desolate.soul.me wrote: > >> So I've just started up with python and an assignment was given to me by >> a company as an recruitment task. >> > so by your own admission you have just started with python yet you > co

Re: BeautifulSoup help !!

2016-10-06 Thread alister
On Thu, 06 Oct 2016 08:50:25 -0700, Navneet Siddhant wrote: > On Thursday, October 6, 2016 at 9:00:21 PM UTC+5:30, alister wrote: >> On Thu, 06 Oct 2016 08:22:05 -0700, desolate.soul.me wrote: >> >> > So I've just started up with python and an assignment was given to me >> > by a company as an re

Re: BeautifulSoup help !!

2016-10-06 Thread Chris Angelico
On Fri, Oct 7, 2016 at 2:50 AM, Navneet Siddhant wrote: > On Thursday, October 6, 2016 at 9:00:21 PM UTC+5:30, alister wrote: >> On Thu, 06 Oct 2016 08:22:05 -0700, desolate.soul.me wrote: >> >> > So I've just started up with python and an assignment was given to me by >> > a company as an recruit

Re: BeautifulSoup help !!

2016-10-06 Thread Navneet Siddhant
On Thursday, October 6, 2016 at 9:00:21 PM UTC+5:30, alister wrote: > On Thu, 06 Oct 2016 08:22:05 -0700, desolate.soul.me wrote: > > > So I've just started up with python and an assignment was given to me by > > a company as an recruitment task. > > > so by your own admission you have just starte

Re: BeautifulSoup help !!

2016-10-06 Thread alister
On Thu, 06 Oct 2016 08:22:05 -0700, desolate.soul.me wrote: > So I've just started up with python and an assignment was given to me by > a company as an recruitment task. > so by your own admission you have just started with python yet you consider your self suitable for employment? -- "Unibus

BeautifulSoup help !!

2016-10-06 Thread desolate . soul . me
oupon title , vendor , validity , description/detail , url to the vendor , image url of the coupon. I have gone through many tutorials on beautifulsoup and have a beginners understanding of using it. Wrote a code as well , but the problem Im facing here is when i collect info from the divs which con

Re: Question about how to do something in BeautifulSoup?

2016-01-23 Thread Cody Piersall
es it's defined > as #00 > - Sometimes the is within the and sometimes the is > within the . > - There may be other discrepancies I haven't noticed yet > > How can I do this in BeautifulSoup (or is this better done in lxml.html)? I hope this helps you get started:

Re: Question about how to do something in BeautifulSoup?

2016-01-22 Thread Mario R. Osorio
I think you'd do better using the pyparsing library On Friday, January 22, 2016 at 9:02:00 AM UTC-5, inhahe wrote: > I hope this is an appropriate mailing list for BeautifulSoup questions, > it's been a long time since I've used python-list and I don't remember if &g

Re: Question about how to do something in BeautifulSoup?

2016-01-22 Thread Peter Otten
inhahe wrote: > I hope this is an appropriate mailing list for BeautifulSoup questions, > it's been a long time since I've used python-list and I don't remember if > third-party modules are on topic. I did try posting to the BeautifulSoup > mailing list on Google group

Question about how to do something in BeautifulSoup?

2016-01-22 Thread inhahe
I hope this is an appropriate mailing list for BeautifulSoup questions, it's been a long time since I've used python-list and I don't remember if third-party modules are on topic. I did try posting to the BeautifulSoup mailing list on Google groups, but I've waited a day o

Re: beautifulSoup 4.1

2015-04-04 Thread Joe Farro
Could use zip: tds = iter(soup('td')) for abbr, defn in zip(tds, tds): print abbr.get_text(), defn.get_text() -- https://mail.python.org/mailman/listinfo/python-list

Workaround for BeautifulSoup/HTML5parser bug

2015-03-21 Thread John Nagle
BeautifulSoup 4 and HTML5parser are known to not play well together. I have a workaround for that. See https://bugs.launchpad.net/beautifulsoup/+bug/1430633 This isn't a fix; it's a postprocessor to fix broken BS4 trees. This is for use until the BS4 maintainers f

Re: beautifulSoup 4.1

2015-03-20 Thread Sayth
Thanks. I couldn't get that second text out. You can use the simpler css class selector I used before in bs4 after 4.1 . The longer version was used to overcome class clashing with the reserved keyword in previous versions. -- https://mail.python.org/mailman/listinfo/python-list

Re: beautifulSoup 4.1

2015-03-20 Thread Denis McMahon
On Fri, 20 Mar 2015 00:18:33 -0700, Sayth Renshaw wrote: > Just finding it odd that the next sibling is a "\n" and not the next > otherwise that would be the perfect solution. Whitespace between elements creates a node in the parsed document. This is correct, because whitespace between elements

Re: beautifulSoup 4.1

2015-03-20 Thread Denis McMahon
On Fri, 20 Mar 2015 07:23:22 +, Denis McMahon wrote: > print td.get_text(), td.find_next_sibling().get_text() A slightly better solution might even be: print td.get_text(), td.find_next_sibling("td").get_text() -- Denis McMahon, denismfmcma...@gmail.com -- https://mail.python.org/mail

Re: beautifulSoup 4.1

2015-03-20 Thread Denis McMahon
On Thu, 19 Mar 2015 21:20:30 -0700, Sayth Renshaw wrote: > But how can I get the value of the following td # find all tds with a class attribute of "abbreviation" abbtds = soup.find_all("td", attrs={"class": "abbreviation"}) # display the text of each abbtd with the text of the next td for td

Re: beautifulSoup 4.1

2015-03-20 Thread Sayth Renshaw
J awk > k-up > > But how can I get the value of the following td. That is for > > class="abbreviation">App I would get Approaching > > So when creating a csv I could use > > print App Approaching > > __ > Abbr | Meaning | > ___

beautifulSoup 4.1

2015-03-19 Thread Sayth Renshaw
izz with soup yet reading here http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class Thanks Sayth -- https://mail.python.org/mailman/listinfo/python-list

Re: beautifulsoup VS lxml

2014-12-12 Thread iMath
在 2014年12月12日星期五UTC+8上午10时19分56秒,Michael Torrie写道: > On 12/11/2014 07:02 PM, iMath wrote: > > > > which is more easy and elegant for pulling data out of HTML? > > Beautiful Soup is specialized for HTML parsing, and it can deal with > badly formed HTML, but if I recall

Re: beautifulsoup VS lxml

2014-12-11 Thread Michael Torrie
On 12/11/2014 07:02 PM, iMath wrote: > > which is more easy and elegant for pulling data out of HTML? Beautiful Soup is specialized for HTML parsing, and it can deal with badly formed HTML, but if I recall correctly BeautifulSoup can use the lxml engine under the hood, so maybe it's

beautifulsoup VS lxml

2014-12-11 Thread iMath
which is more easy and elegant for pulling data out of HTML? -- https://mail.python.org/mailman/listinfo/python-list

Re: Need Help with the BeautifulSoup problem, please

2013-12-16 Thread Peter Otten
seasp...@gmail.com wrote: > I need to replace all tag with after ■. But the result from > below is '■ D / ' > Can you explain what I did wrong, please. > > s = '■A B C D / ' > soup = BeautifulSoup(s) > for i in soup.find_all(text

Re: Need Help with the BeautifulSoup problem, please

2013-12-16 Thread seaspeak
> Can you explain what I did wrong, please. > > > > > > > > > > > > s = '■A B C D / ' > > > > > > soup = BeautifulSoup(s) > > > > > > for i in soup.find_all(text='■'): > > &g

Re: Need Help with the BeautifulSoup problem, please

2013-12-16 Thread Andreas Perstinger
On 16.12.2013 07:41, seasp...@gmail.com wrote: I need to replace all tag with after ■. But the result frombelow is '■ D / ' Can you explain what I did wrong, please. s = '■A B C D / ' soup = BeautifulSoup(s) for i in soup.find_all(text='■'):

Re: Need Help with the BeautifulSoup problem, please

2013-12-16 Thread seaspeak
seas...@gmail.com於 2013年12月16日星期一UTC+8下午2時41分08秒寫道: > I need to replace all tag with after ■. But the result from below > is '■ D / ' > > Can you explain what I did wrong, please. > > > > s = '■A B C D / ' > > soup = Bea

Re: Need Help with the BeautifulSoup problem, please

2013-12-16 Thread 88888 Dihedral
On Monday, December 16, 2013 2:41:08 PM UTC+8, seas...@gmail.com wrote: > I need to replace all tag with after ■. But the result from below > is '■ D / ' > > Can you explain what I did wrong, please. > > > > s = '■A B C D / ' >

Need Help with the BeautifulSoup problem, please

2013-12-15 Thread seaspeak
I need to replace all tag with after ■. But the result from below is '■ D / ' Can you explain what I did wrong, please. s = '■A B C D / ' soup = BeautifulSoup(s) for i in soup.find_all(text='■'): tag = soup.new_tag('span')

Re: how to extract page-URL using BeautifulSoup

2013-11-01 Thread Joel Goldstick
700, bhaktanishant wrote: > >> I want to extract the page-url. for example: >> if i have this code >> >> import urllib2 from bs4 import BeautifulSoup link = >> "http://www.google.com"; >> page = urllib2.urlopen(link).read() >> soup = BeautifulSoup(pa

Re: how to extract page-URL using BeautifulSoup

2013-11-01 Thread Alister
On Thu, 31 Oct 2013 08:59:00 -0700, bhaktanishant wrote: > I want to extract the page-url. for example: > if i have this code > > import urllib2 from bs4 import BeautifulSoup link = > "http://www.google.com"; > page = urllib2.urlopen(link).read() > soup = Beau

Re: how to extract page-URL using BeautifulSoup

2013-10-31 Thread MRAB
On 31/10/2013 15:59, bhaktanish...@gmail.com wrote: I want to extract the page-url. for example: if i have this code import urllib2 from bs4 import BeautifulSoup link = "http://www.google.com"; page = urllib2.urlopen(link).read() soup = BeautifulSoup(page) then i can extract title

how to extract page-URL using BeautifulSoup

2013-10-31 Thread bhaktanishant
I want to extract the page-url. for example: if i have this code import urllib2 from bs4 import BeautifulSoup link = "http://www.google.com"; page = urllib2.urlopen(link).read() soup = BeautifulSoup(page) then i can extract title of page by: title = soup.title but i want to know t

Re: Trying to write beautifulsoup result to a file and get error message

2011-11-14 Thread Andreas Perstinger
On 2011-11-13 23:37, goldtech wrote: If I try: ... soup = BeautifulSoup(ft3) f = open(r'c:\NewFolder\clean4.html', "w") f.write(soup) f.close() I get error message: Traceback (most recent call last): File "C:\Documents and Settings\user01\Desktop\py\tb1a.py",

Re: Trying to write beautifulsoup result to a file and get error message

2011-11-13 Thread MRAB
On 13/11/2011 22:37, goldtech wrote: If I try: ... soup = BeautifulSoup(ft3) f = open(r'c:\NewFolder\clean4.html', "w") f.write(soup) f.close() I get error message: Traceback (most recent call last): File "C:\Documents and Settings\user01\Desktop\py\tb1a.py",

Trying to write beautifulsoup result to a file and get error message

2011-11-13 Thread goldtech
If I try: ... soup = BeautifulSoup(ft3) f = open(r'c:\NewFolder\clean4.html', "w") f.write(soup) f.close() I get error message: Traceback (most recent call last): File "C:\Documents and Settings\user01\Desktop\py\tb1a.py", line 203, in f.write(soup) TypeErro

Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-08 Thread Nobody
quot;r").read() > fileContent = fileObj.decode("iso-8859-2") > fileSoup = BeautifulSoup(fileContent) The fileObj.decode() step should be unnecessary, and is usually undesirable; Beautiful Soup should be doing the decoding itself. If you actually know the encoding (e.g. from

Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-06 Thread John Gordon
In xDog Walker writes: > What is this io of which you speak? It was introduced in Python 2.6. -- John Gordon A is for Amy, who fell down the stairs gor...@panix.com B is for Basil, assaulted by bears -- Edward Gorey, "The Gashl

Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-06 Thread xDog Walker
On Thursday 2011 October 06 10:41, jmfauth wrote: > or  (Python2/Python3) > > >>> import io > >>> with io.open('abc.txt', 'r', encoding='iso-8859-2') as f: > > ...     r = f.read() > ... > > >>> repr(r) > > u'a\nb\nc\n' > > >>> with io.open('def.txt', 'w', encoding='utf-8-sig') as f: > > ...     t

Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-06 Thread jmfauth
eObj = open(filePath,"r").read() > fileContent = fileObj.decode("iso-8859-2") > fileSoup = BeautifulSoup(fileContent) > > ## Do some BeautifulSoup magic and preserve unicode, presume result is > saved in 'text' ## > > ## write extracted text to file

  1   2   3   4   >