Comment Holder <commenthol...@gmail.com> writes:

> Hi,
> I am totally new to Python. I noticed that there are many videos showing how 
> to collect data from Python, but I am not sure if I would be able to 
> accomplish my goal using Python so I can start learning.
>
> Here is the example of the target page:
> http://and.medianewsonline.com/hello.html
> In this example, there are 10 articles.
>
> What I exactly need is to do the following:
> 1- Collect the article title, date, source, and contents.
> 2- I need to be able to export the final results to excel or a database 
> client. That is, I need to have all of those specified in step 1 in one row, 
> while each of them saved in separate column. For example:
>
> Title1    Date1   Source1   Contents1
> Title2    Date2   Source2   Contents2
>
> I appreciate any advise regarding my case. 
>
> Thanks & Regards//

Here is an attempt for you. It uses BeatifulSoup 4. It is written in Python 
3.3, so if you want to use Python 2.x you will have to make some small changes, 
like
from urllib import urlopen
and probably something with the print statements.

The formatting in columns is left as an exercise for you. I wonder how you 
would want that with multiparagraph contents.

from bs4 import BeautifulSoup
from urllib.request import urlopen

URL = "http://and.medianewsonline.com/hello.html";
html = urlopen(URL).read()
soup = BeautifulSoup(html)
arts = soup.find_all('div', class_='articleHeader')

for art in arts:
    name = art.contents[0].string.strip()
    print(name)
    artbody = art.find_next_sibling('div', class_='article enArticle')
    titlenode = artbody.find_next('div', id='hd')
    title = titlenode.get_text().strip()
    print("Title: {0}".format(title))
    srcnode = titlenode.find_next('a')
    while srcnode.parent.get('class') == ['author']:
        srcnode=srcnode.find_next('a')
    source = srcnode.string
    srcnode = srcnode.parent
    date = srcnode.find_previous_sibling('div').string
    print("Date: {0}".format(date))
    print("Source: {0}".format(source))
    cont = srcnode.find_next_siblings('p', class_='articleParagraph 
enarticleParagraph')
    contents = '\n'.join([c.get_text() for c in cont])
    print("Contents: {0}".format(contents))

    
-- 
Piet van Oostrum <p...@vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to