On 2022-08-22 19:27:28 -, Jon Ribbens via Python-list wrote:
> On 2022-08-22, Peter J. Holzer wrote:
> > On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote:
> >> With the offset though, BeautifulSoup made an arbitrary decision to
> >> use ISO-8859
On 2022-08-22, Peter J. Holzer wrote:
> On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote:
>> With the offset though, BeautifulSoup made an arbitrary decision to
>> use ISO-8859-1 encoding and so when you chopped the bytestring at
>> that offset it only worked
On 2022-08-22 00:45:56 -, Jon Ribbens via Python-list wrote:
> With the offset though, BeautifulSoup made an arbitrary decision to
> use ISO-8859-1 encoding and so when you chopped the bytestring at
> that offset it only worked because BeautifulSoup had happened to
> choose
On 2022-08-22 00:09:01 -, Jon Ribbens via Python-list wrote:
> On 2022-08-21, Peter J. Holzer wrote:
> > On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote:
> >> result = re.sub(
> >> r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""",
> >
> > This will fail on:
> >
>
> >
>> >> > That's the bit that's hard to prove.
>> >> >
>> >> >> The OP could edit the files with regexps to create a new version.
>> >> >
>> >> > To you and Jon, who also suggested this: how would that be ben
On 2022-08-21, Peter J. Holzer wrote:
> On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote:
>> On 2022-08-20, Stefan Ram wrote:
>> > Jon Ribbens writes:
>> >>... or you could avoid all that faff and just do re.sub()?
>
>> > source = ''
>> >
>> > # Use Python to change the source, ke
t that attributes are reorderd:
>>> import bs4
>>> soup = bs4.BeautifulSoup("""""")
>>> soup
>>> class Formatter(bs4.formatter.HTMLFormatter):
def attributes(self, tag):
return [] if tag.attrs is None else list(tag.attrs.items(
On Mon, 22 Aug 2022 at 10:04, Buck Evan wrote:
>
> I've had much success doing round trips through the lxml.html parser.
>
> https://lxml.de/lxmlhtml.html
>
> I ditched bs for lxml long ago and never regretted it.
>
> If you find that you have a bunch of invalid html that lxml inadvertently
> "fi
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
>
> Using the Alice example from the BS4 docs:
>
> >>> html_doc = """The Dormouse's story
>
> The Dormouse's story
>
> Once upon a t
ould edit the files with regexps to create a new version.
> >> >
> >> > To you and Jon, who also suggested this: how would that be beneficial?
> >> > With Beautiful Soup, I have the line number and position within the
> >> > line where the tag start
With Beautiful Soup, I have the line number and position within the
>> > line where the tag starts; what does a regex give me that I don't have
>> > that way?
>>
>> You mean you could use BeautifulSoup to read the file and identify the
>> bits you want to chang
On 2022-08-20 21:51:41 -, Jon Ribbens via Python-list wrote:
> On 2022-08-20, Stefan Ram wrote:
> > Jon Ribbens writes:
> >>... or you could avoid all that faff and just do re.sub()?
> > source = ''
> >
> > # Use Python to change the source, keeping the order of attributes.
> >
> > result =
;
>>>>
>>>>
>>>>>> On 19 Aug 2022, at 19:33, Chris Angelico wrote:
>>>>>
>>>>> What's the best way to precisely reconstruct an HTML file after
>>>>> parsing it with BeautifulSoup?
>>>>
>&
gt; >>>
> >>> What's the best way to precisely reconstruct an HTML file after
> >>> parsing it with BeautifulSoup?
> >>
> >> I recall that in bs4 it parses into an object tree and loses the detail of
> >> the input.
> >> I
> On 19 Aug 2022, at 22:04, Chris Angelico wrote:
>
> On Sat, 20 Aug 2022 at 05:12, Barry wrote:
>>
>>
>>
>>>> On 19 Aug 2022, at 19:33, Chris Angelico wrote:
>>>
>>> What's the best way to precisely reconstruct an HTML file
On Sun, 21 Aug 2022 at 13:41, dn wrote:
>
> On 21/08/2022 13.00, Chris Angelico wrote:
> > Well, I don't like headaches, but I do appreciate what the G&S Archive
> > has given me over the years, so I'm taking this on as a means of
> > giving back to the community.
>
> This point will be picked-up
On 21/08/2022 13.00, Chris Angelico wrote:
> On Sun, 21 Aug 2022 at 09:48, dn wrote:
>> On 20/08/2022 12.38, Chris Angelico wrote:
>>> On Sat, 20 Aug 2022 at 10:19, dn wrote:
On 20/08/2022 09.01, Chris Angelico wrote:
> On Sat, 20 Aug 2022 at 05:12, Barry wrote:
>>> On 19 Aug 2022,
gt; On 19 Aug 2022, at 19:33, Chris Angelico wrote:
> >>>>>
> >>>>> What's the best way to precisely reconstruct an HTML file after
> >>>>> parsing it with BeautifulSoup?
> ...
>
> >>> well. Thanks for trying, anyhow.
> >>&g
.
> >
> >> The OP could edit the files with regexps to create a new version.
> >
> > To you and Jon, who also suggested this: how would that be beneficial?
> > With Beautiful Soup, I have the line number and position within the
> > line where the tag star
t;>>> What's the best way to precisely reconstruct an HTML file after
>>>>> parsing it with BeautifulSoup?
...
>>> well. Thanks for trying, anyhow.
>>>
>>> So I'm left with a few options:
>>>
>>> 1) Give up on valid
= re.sub( r'href\s*=\s*"http"', r'href="https"', source )
> result = re.sub( r"href\s*=\s*'http'", r"href='https'", result )
You could go a bit harder with the regexp of course, e.g.:
result = re.sub(
r""
suggested this: how would that be beneficial?
> With Beautiful Soup, I have the line number and position within the
> line where the tag starts; what does a regex give me that I don't have
> that way?
You mean you could use BeautifulSoup to read the file and identify the
bits you wan
and position within the
line where the tag starts; what does a regex give me that I don't have
that way?
> Soup := BeautifulSoup.
>
> Then have Soup read both the new version and the old version.
>
> Then have Soup also edit the old version read in, the same way as
&g
On 2022-08-19, Chris Angelico wrote:
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
>
> Using the Alice example from the BS4 docs:
>
>>>> html_doc = """The Dormouse's story
>
>The Do
at's the best way to precisely reconstruct an HTML file after
> >>> parsing it with BeautifulSoup?
> >>
> >> I recall that in bs4 it parses into an object tree and loses the detail of
> >> the input.
> >> I recently ported from very old bs to bs4
On 20/08/2022 09.01, Chris Angelico wrote:
> On Sat, 20 Aug 2022 at 05:12, Barry wrote:
>>
>>
>>
>>> On 19 Aug 2022, at 19:33, Chris Angelico wrote:
>>>
>>> What's the best way to precisely reconstruct an HTML file after
>>> parsi
On Sat, 20 Aug 2022 at 10:04, David wrote:
>
> On Sat, 20 Aug 2022 at 04:31, Chris Angelico wrote:
>
> > What's the best way to precisely reconstruct an HTML file after
> > parsing it with BeautifulSoup?
>
> > Note two distinct changes: firstly, whitespace
On Sat, 20 Aug 2022 at 04:31, Chris Angelico wrote:
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
> Note two distinct changes: firstly, whitespace has been removed, and
> secondly, attributes are reordered (I think alphabeti
On Sat, 20 Aug 2022 at 05:12, Barry wrote:
>
>
>
> > On 19 Aug 2022, at 19:33, Chris Angelico wrote:
> >
> > What's the best way to precisely reconstruct an HTML file after
> > parsing it with BeautifulSoup?
>
> I recall that in bs4 it parses into a
On 2022-08-19 at 20:12:35 +0100,
Barry wrote:
> > On 19 Aug 2022, at 19:33, Chris Angelico wrote:
> >
> > What's the best way to precisely reconstruct an HTML file after
> > parsing it with BeautifulSoup?
>
> I recall that in bs4 it parses into an object
> On 19 Aug 2022, at 19:33, Chris Angelico wrote:
>
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
I recall that in bs4 it parses into an object tree and loses the detail of the
input.
I recently ported from very old bs t
What's the best way to precisely reconstruct an HTML file after
parsing it with BeautifulSoup?
Using the Alice example from the BS4 docs:
>>> html_doc = """The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and
th
Nathan Zhu wrote:
> Hi Team,
>
> could anyone help me?
>
> for webpage having source code like this:
> ...
>
> number
> name
>
>
> I only can use below sentence, since there are a lot of tag em and tag a
> in other area.
> output =
>
Hi Team,
could anyone help me?
for webpage having source code like this:
...
number
name
I only can use below sentence, since there are a lot of tag em and tag a in
other area.
output = bs4.BeautifulSoup(res.content,'lxml').findAll("span",{"class":&quo
Ah, shoot me. I had a .join() statement on the output queue but not on
in the input queue. So the threads for the input queue got terminated
before BeautifulSoup could get started. I went down that same rabbit
hole with CSVWriter the other day. *sigh*
Thanks for everyone's help.
Ch
ead_threads requesting and putting pages into the output
> queue that is the input_queue for the parser. My soup_threads can get
> items from the queue, but BeautifulSoup doesn't do anything after that.
>
> Chris R.
--
https://mail.python.org/mailman/listinfo/python-list
Christopher Reimer writes:
> I have 20 read_threads requesting and putting pages into the output
> queue that is the input_queue for the parser.
Given how slow parsing is, you probably want to scrap the pages into
disk files, and then run the parser in parallel processes that read from
the disk.
On 8/27/2017 1:50 PM, MRAB wrote:
What if you don't sort the list? I ask because it sounds like you're
changing 2 variables (i.e. list->queue, sorted->unsorted) at the same
time, so you can't be sure that it's the queue that's the problem.
If I'm using a list, I'm using a for loop to input ite
20 read_threads requesting and putting pages into the output
queue that is the input_queue for the parser. My soup_threads can get
items from the queue, but BeautifulSoup doesn't do anything after that.
Chris R.
--
https://mail.python.org/mailman/listinfo/python-list
On 2017-08-27 21:35, Christopher Reimer via Python-list wrote:
On 8/27/2017 1:12 PM, MRAB wrote:
What do you mean by "queue (random order)"? A queue is sequential
order, first-in-first-out.
With 20 threads requesting 20 different pages, they're not going into
the queue in sequential order (i
On 8/27/2017 1:12 PM, MRAB wrote:
What do you mean by "queue (random order)"? A queue is sequential
order, first-in-first-out.
With 20 threads requesting 20 different pages, they're not going into
the queue in sequential order (i.e., 0, 1, 2, ..., 17, 18, 19) and
coming in at different time
Christopher Reimer via Python-list wrote:
> On 8/27/2017 11:54 AM, Peter Otten wrote:
>
>> The documentation
>>
>> https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup
>>
>> says you can make the BeautifulSoup object from a string or file
On 2017-08-27 20:35, Christopher Reimer via Python-list wrote:
On 8/27/2017 11:54 AM, Peter Otten wrote:
The documentation
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup
says you can make the BeautifulSoup object from a string or file.
Can you give a few more details
On 8/27/2017 11:54 AM, Peter Otten wrote:
The documentation
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup
says you can make the BeautifulSoup object from a string or file.
Can you give a few more details where the queue comes into play? A small
code sample would be
le thread)
>
> It takes 15 minutes to process ~11,000 comments.
>
> When I replaced the list with a queue between the Requestor and Parser
> to speed up things, BeautifulSoup stopped working.
>
> When I changed BeautifulSoup(contents, "lxml") to
> BeautifulSoup(conte
d the list with a queue between the Requestor and Parser
to speed up things, BeautifulSoup stopped working.
When I changed BeautifulSoup(contents, "lxml") to
BeautifulSoup(contents), I get the UserWarning that no parser wasn't
explicitly set and a reference to line 80 in threading
Umar Yusuf writes:
> Hi all,
>
> I need help extracting the table from this url...?
>
> from bs4 import BeautifulSoup
> url = "https://www.marinetraffic.com/en/ais/index/ports/all/per_page:50";
>
> headers = {'User-agent': 'Mozilla/5.0
Hi all,
I need help extracting the table from this url...?
from bs4 import BeautifulSoup
url = "https://www.marinetraffic.com/en/ais/index/ports/all/per_page:50";
headers = {'User-agent': 'Mozilla/5.0'}
raw_html = requests.get(url, headers=headers)
raw_
On Sunday, November 6, 2016 at 1:27:48 AM UTC-4, rosef...@gmail.com wrote:
> Considering the following html:
>
> cool stuff hiid="cool"> zz
>
> and the following list:
>
> ignore_list = ['example','lalala']
>
> My goal
Considering the following html:
cool stuff hizz
and the following list:
ignore_list = ['example','lalala']
My goal is, while going through the HTML using Beautifulsoup, I find a h2 that
has an ID that is in my list (ignore_list) I should delete all the
On Fri, 07 Oct 2016 03:12:32 +1100, Steve D'Aprano wrote:
> On Fri, 7 Oct 2016 02:30 am, alister wrote:
>
>> On Thu, 06 Oct 2016 08:22:05 -0700, desolate.soul.me wrote:
>>
>>> So I've just started up with python and an assignment was given to me
>>> by a company as an recruitment task.
>>>
>> so
i was able to almost get the results i require but only half of the info is
showing up. i.e suppose there are 20 coupons showing on the page but when i
print them using the container they are in only 5 are printed on the screen.
Also figured out how to write to csv file , but even here only 1 ro
again I dont need complete code , a
> resource where I could find more info about using Beautifulsoup will be
> appreciated. Also do I need some kind of plugin etc to extract data to
> csv ? or it is built in python and I could simply import csv and write
> other commands needed ??
Beaut
h it. Let me clarify once again I dont need
> complete code , a resource where I could find more info about using
> Beautifulsoup will be appreciated. Also do I need some kind of
> plugin etc to extract data to csv ? or it is built in python and I
> could simply import csv and write other com
where I could
find more info about using Beautifulsoup will be appreciated. Also do I need
some kind of plugin etc to extract data to csv ? or it is built in python and I
could simply import csv and write other commands needed ??
--
https://mail.python.org/mailman/listinfo/python-list
On Fri, Oct 7, 2016 at 4:00 AM, Navneet Siddhant
wrote:
> I guess I shouldnt have mentioned as this was a recruitment task. If needed I
> can post a screenshot of the mail I got which says I can take help from
> anywhere possible as long as the assignment is done. Wont be simply copying
> pasti
On Fri, Oct 7, 2016 at 3:38 AM, Steve D'Aprano
wrote:
> On Fri, 7 Oct 2016 03:00 am, Chris Angelico wrote:
>
>> You are asking
>> for assistance with something that was assigned to you *as a
>> recruitment task*. Were you told that asking for help was a legitimate
>> solution?
>
> Why should he ne
onraja.in and export it to csv format.
> The details which I need to be present in the csv are the coupon title ,
> vendor , validity , description/detail , url to the vendor , image url of the
> coupon.
>
> I have gone through many tutorials on beautifulsoup and have a beginners
On Fri, 7 Oct 2016 03:00 am, Chris Angelico wrote:
> You are asking
> for assistance with something that was assigned to you *as a
> recruitment task*. Were you told that asking for help was a legitimate
> solution?
Why should he need to be told that? Asking for help *is* a legitimate
solution, j
On Thursday, October 6, 2016 at 9:57:46 PM UTC+5:30, Navneet Siddhant wrote:
> On Thursday, October 6, 2016 at 9:42:47 PM UTC+5:30, Steve D'Aprano wrote:
> > On Fri, 7 Oct 2016 02:30 am, alister wrote:
> >
> > > On Thu, 06 Oct 2016 08:22:05 -0700, desolate.soul.me wrote:
> > >
> > >> So I've just
On Thursday, October 6, 2016 at 9:42:47 PM UTC+5:30, Steve D'Aprano wrote:
> On Fri, 7 Oct 2016 02:30 am, alister wrote:
>
> > On Thu, 06 Oct 2016 08:22:05 -0700, desolate.soul.me wrote:
> >
> >> So I've just started up with python and an assignment was given to me by
> >> a company as an recruit
+1 at Steve
On 6 Oct 2016 19:17, "Steve D'Aprano" wrote:
> On Fri, 7 Oct 2016 02:30 am, alister wrote:
>
> > On Thu, 06 Oct 2016 08:22:05 -0700, desolate.soul.me wrote:
> >
> >> So I've just started up with python and an assignment was given to me by
> >> a company as an recruitment task.
>
On Fri, 7 Oct 2016 02:30 am, alister wrote:
> On Thu, 06 Oct 2016 08:22:05 -0700, desolate.soul.me wrote:
>
>> So I've just started up with python and an assignment was given to me by
>> a company as an recruitment task.
>>
> so by your own admission you have just started with python yet you
> co
On Thu, 06 Oct 2016 08:50:25 -0700, Navneet Siddhant wrote:
> On Thursday, October 6, 2016 at 9:00:21 PM UTC+5:30, alister wrote:
>> On Thu, 06 Oct 2016 08:22:05 -0700, desolate.soul.me wrote:
>>
>> > So I've just started up with python and an assignment was given to me
>> > by a company as an re
On Fri, Oct 7, 2016 at 2:50 AM, Navneet Siddhant
wrote:
> On Thursday, October 6, 2016 at 9:00:21 PM UTC+5:30, alister wrote:
>> On Thu, 06 Oct 2016 08:22:05 -0700, desolate.soul.me wrote:
>>
>> > So I've just started up with python and an assignment was given to me by
>> > a company as an recruit
On Thursday, October 6, 2016 at 9:00:21 PM UTC+5:30, alister wrote:
> On Thu, 06 Oct 2016 08:22:05 -0700, desolate.soul.me wrote:
>
> > So I've just started up with python and an assignment was given to me by
> > a company as an recruitment task.
> >
> so by your own admission you have just starte
On Thu, 06 Oct 2016 08:22:05 -0700, desolate.soul.me wrote:
> So I've just started up with python and an assignment was given to me by
> a company as an recruitment task.
>
so by your own admission you have just started with python yet you
consider your self suitable for employment?
--
"Unibus
oupon title , vendor
, validity , description/detail , url to the vendor , image url of the coupon.
I have gone through many tutorials on beautifulsoup and have a beginners
understanding of using it. Wrote a code as well , but the problem Im facing
here is when i collect info from the divs which con
es it's
defined
> as #00
> - Sometimes the is within the and sometimes the is
> within the .
> - There may be other discrepancies I haven't noticed yet
>
> How can I do this in BeautifulSoup (or is this better done in lxml.html)?
I hope this helps you get started:
I think you'd do better using the pyparsing library
On Friday, January 22, 2016 at 9:02:00 AM UTC-5, inhahe wrote:
> I hope this is an appropriate mailing list for BeautifulSoup questions,
> it's been a long time since I've used python-list and I don't remember if
&g
inhahe wrote:
> I hope this is an appropriate mailing list for BeautifulSoup questions,
> it's been a long time since I've used python-list and I don't remember if
> third-party modules are on topic. I did try posting to the BeautifulSoup
> mailing list on Google group
I hope this is an appropriate mailing list for BeautifulSoup questions,
it's been a long time since I've used python-list and I don't remember if
third-party modules are on topic. I did try posting to the BeautifulSoup
mailing list on Google groups, but I've waited a day o
Could use zip:
tds = iter(soup('td'))
for abbr, defn in zip(tds, tds):
print abbr.get_text(), defn.get_text()
--
https://mail.python.org/mailman/listinfo/python-list
BeautifulSoup 4 and HTML5parser are known to not play well together.
I have a workaround for that. See
https://bugs.launchpad.net/beautifulsoup/+bug/1430633
This isn't a fix; it's a postprocessor to fix broken BS4 trees.
This is for use until the BS4 maintainers f
Thanks.
I couldn't get that second text out.
You can use the simpler css class selector I used before in bs4 after 4.1 .
The longer version was used to overcome class clashing with the reserved
keyword in previous versions.
--
https://mail.python.org/mailman/listinfo/python-list
On Fri, 20 Mar 2015 00:18:33 -0700, Sayth Renshaw wrote:
> Just finding it odd that the next sibling is a "\n" and not the next
> otherwise that would be the perfect solution.
Whitespace between elements creates a node in the parsed document. This
is correct, because whitespace between elements
On Fri, 20 Mar 2015 07:23:22 +, Denis McMahon wrote:
> print td.get_text(), td.find_next_sibling().get_text()
A slightly better solution might even be:
print td.get_text(), td.find_next_sibling("td").get_text()
--
Denis McMahon, denismfmcma...@gmail.com
--
https://mail.python.org/mail
On Thu, 19 Mar 2015 21:20:30 -0700, Sayth Renshaw wrote:
> But how can I get the value of the following td
# find all tds with a class attribute of "abbreviation"
abbtds = soup.find_all("td", attrs={"class": "abbreviation"})
# display the text of each abbtd with the text of the next td
for td
J awk
> k-up
>
> But how can I get the value of the following td. That is for
>
> class="abbreviation">App I would get Approaching
>
> So when creating a csv I could use
>
> print App Approaching
>
> __
> Abbr | Meaning |
> ___
izz with soup yet reading
here
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
Thanks
Sayth
--
https://mail.python.org/mailman/listinfo/python-list
在 2014年12月12日星期五UTC+8上午10时19分56秒,Michael Torrie写道:
> On 12/11/2014 07:02 PM, iMath wrote:
> >
> > which is more easy and elegant for pulling data out of HTML?
>
> Beautiful Soup is specialized for HTML parsing, and it can deal with
> badly formed HTML, but if I recall
On 12/11/2014 07:02 PM, iMath wrote:
>
> which is more easy and elegant for pulling data out of HTML?
Beautiful Soup is specialized for HTML parsing, and it can deal with
badly formed HTML, but if I recall correctly BeautifulSoup can use the
lxml engine under the hood, so maybe it's
which is more easy and elegant for pulling data out of HTML?
--
https://mail.python.org/mailman/listinfo/python-list
seasp...@gmail.com wrote:
> I need to replace all tag with after ■. But the result from
> below is '■ D / '
> Can you explain what I did wrong, please.
>
> s = '■A B C D / '
> soup = BeautifulSoup(s)
> for i in soup.find_all(text
> Can you explain what I did wrong, please.
>
> >
>
> >
>
> >
>
> > s = '■A B C D / '
>
> >
>
> > soup = BeautifulSoup(s)
>
> >
>
> > for i in soup.find_all(text='■'):
>
> &g
On 16.12.2013 07:41, seasp...@gmail.com wrote:
I need to replace all tag with after ■. But the result
frombelow is '■ D / '
Can you explain what I did wrong, please.
s = '■A B C D / '
soup = BeautifulSoup(s)
for i in soup.find_all(text='■'):
seas...@gmail.com於 2013年12月16日星期一UTC+8下午2時41分08秒寫道:
> I need to replace all tag with after ■. But the result from below
> is '■ D / '
>
> Can you explain what I did wrong, please.
>
>
>
> s = '■A B C D / '
>
> soup = Bea
On Monday, December 16, 2013 2:41:08 PM UTC+8, seas...@gmail.com wrote:
> I need to replace all tag with after ■. But the result from below
> is '■ D / '
>
> Can you explain what I did wrong, please.
>
>
>
> s = '■A B C D / '
>
I need to replace all tag with after ■. But the result from
below is '■ D / '
Can you explain what I did wrong, please.
s = '■A B C D / '
soup = BeautifulSoup(s)
for i in soup.find_all(text='■'):
tag = soup.new_tag('span')
700, bhaktanishant wrote:
>
>> I want to extract the page-url. for example:
>> if i have this code
>>
>> import urllib2 from bs4 import BeautifulSoup link =
>> "http://www.google.com";
>> page = urllib2.urlopen(link).read()
>> soup = BeautifulSoup(pa
On Thu, 31 Oct 2013 08:59:00 -0700, bhaktanishant wrote:
> I want to extract the page-url. for example:
> if i have this code
>
> import urllib2 from bs4 import BeautifulSoup link =
> "http://www.google.com";
> page = urllib2.urlopen(link).read()
> soup = Beau
On 31/10/2013 15:59, bhaktanish...@gmail.com wrote:
I want to extract the page-url. for example:
if i have this code
import urllib2
from bs4 import BeautifulSoup
link = "http://www.google.com";
page = urllib2.urlopen(link).read()
soup = BeautifulSoup(page)
then i can extract title
I want to extract the page-url. for example:
if i have this code
import urllib2
from bs4 import BeautifulSoup
link = "http://www.google.com";
page = urllib2.urlopen(link).read()
soup = BeautifulSoup(page)
then i can extract title of page by:
title = soup.title
but i want to know t
On 2011-11-13 23:37, goldtech wrote:
If I try:
...
soup = BeautifulSoup(ft3)
f = open(r'c:\NewFolder\clean4.html', "w")
f.write(soup)
f.close()
I get error message:
Traceback (most recent call last):
File "C:\Documents and Settings\user01\Desktop\py\tb1a.py",
On 13/11/2011 22:37, goldtech wrote:
If I try:
...
soup = BeautifulSoup(ft3)
f = open(r'c:\NewFolder\clean4.html', "w")
f.write(soup)
f.close()
I get error message:
Traceback (most recent call last):
File "C:\Documents and Settings\user01\Desktop\py\tb1a.py",
If I try:
...
soup = BeautifulSoup(ft3)
f = open(r'c:\NewFolder\clean4.html', "w")
f.write(soup)
f.close()
I get error message:
Traceback (most recent call last):
File "C:\Documents and Settings\user01\Desktop\py\tb1a.py", line
203, in
f.write(soup)
TypeErro
quot;r").read()
> fileContent = fileObj.decode("iso-8859-2")
> fileSoup = BeautifulSoup(fileContent)
The fileObj.decode() step should be unnecessary, and is usually
undesirable; Beautiful Soup should be doing the decoding itself.
If you actually know the encoding (e.g. from
In xDog Walker
writes:
> What is this io of which you speak?
It was introduced in Python 2.6.
--
John Gordon A is for Amy, who fell down the stairs
gor...@panix.com B is for Basil, assaulted by bears
-- Edward Gorey, "The Gashl
On Thursday 2011 October 06 10:41, jmfauth wrote:
> or (Python2/Python3)
>
> >>> import io
> >>> with io.open('abc.txt', 'r', encoding='iso-8859-2') as f:
>
> ... r = f.read()
> ...
>
> >>> repr(r)
>
> u'a\nb\nc\n'
>
> >>> with io.open('def.txt', 'w', encoding='utf-8-sig') as f:
>
> ... t
eObj = open(filePath,"r").read()
> fileContent = fileObj.decode("iso-8859-2")
> fileSoup = BeautifulSoup(fileContent)
>
> ## Do some BeautifulSoup magic and preserve unicode, presume result is
> saved in 'text' ##
>
> ## write extracted text to file
1 - 100 of 321 matches
Mail list logo