Well, Python has "Beautiful Soup".

http://www.crummy.com/software/BeautifulSoup/

"You didn't write that awful page. You're just trying to get some data out
of it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser"
In PHP I have use http://simplehtmldom.sourceforge.net/ as a way of parsing
badly formed HTML.

I wrote a script to import nodes using the latter and then saved them with
"node_save()".

An alternative could be to parse to CSV, then import using the node_export
or node_import modules.

Hope that helps,

Victor Kane
http://awebfactory.com.ar
http://projectflowandtracker.com

On Wed, Dec 1, 2010 at 7:46 AM, Balazs Dianiska <[email protected]>wrote:

> Sadly some of the older legacy sites are just not available in rss, I
> had such a scraping request recently. I have to say that with
> drupal_http_request you don't even have to look at curl. You can do
> all sorts of things, even faking logins.
>
> To parse the HTML use querypath, a trick that we use is to first run
> some sort of HTML tidyup library on the downloaded page, otherwise
> querypath runs away crying. beautify module can help you a great deal
> with that.
>
> Balazs
>
> On Wed, Dec 1, 2010 at 5:27 AM, Cameron Eagans <[email protected]> wrote:
> > Most of the time, you can get to the posts via RSS. Aggregator module
> does a
> > pretty good job of pulling stuff in, and the author of the post that's
> > displayed is whatever you tell it to display (see Drupal Planet for an
> > example)
> > Thanks,
> > Cameron
> >
> >
> >
> > On Tue, Nov 30, 2010 at 12:48, Kevin O <[email protected]> wrote:
> >>
> >> I second the recommendation of using QueryPath. I use it almost
> >> exclusively along with drupal_http_request, though I use curl only in a
> few
> >> places (if you use curl I recommend http://drupal.org/project/curl for
> a
> >> dependency check). I'd really recommend though creating a custom module
> that
> >> uses the above and then has your logic for filtering in it, I've done
> this
> >> for about a dozen modules now.
> >> That said, there are some more modules available out there nowadays,
> such
> >> as using http://drupal.org/project/feeds_xpathparser with feeds
> >> http://drupal.org/project/feeds There are about a dozen more modules
> that
> >> will accomplish the goal though I haven't used them, but I went through
> and
> >> tried most of the methods out for some recent projects.
> >> Cheers,
> >> Kevin O'Brien
> >> Drupal Developer
> >> http://www.coderintherye.com
> >> 415-754-0112
> >>
> >>
> >> On Tue, Nov 30, 2010 at 11:26 AM, <[email protected]>
> wrote:
> >>>
> >>> Send development mailing list submissions to
> >>>        [email protected]
> >>>
> >>> To subscribe or unsubscribe via the World Wide Web, visit
> >>>        http://lists.drupal.org/mailman/listinfo/development
> >>> or, via email, send a message with subject or body 'help' to
> >>>        [email protected]
> >>>
> >>> You can reach the person managing the list at
> >>>        [email protected]
> >>>
> >>> When replying, please edit your Subject line so it is more specific
> >>> than "Re: Contents of development digest..."
> >>>
> >>>
> >>> Today's Topics:
> >>>
> >>>   1. Drupal module for scraping information from an    HTML/XML
> >>>      document (James Benstead)
> >>>   2. Re: Drupal module for scraping information from an HTML/XML
> >>>      document (John Fiala)
> >>>   3. Easter problem (?mon Tam?s)
> >>>   4. Re: Easter problem (Carl Wiedemann)
> >>>   5. Re: Easter problem ([email protected])
> >>>   6. Re: Easter problem ([email protected])
> >>>   7. Re: Easter problem ([email protected])
> >>>   8. Re: Easter problem (Jennifer Hodgdon)
> >>>
> >>>
> >>> ----------------------------------------------------------------------
> >>>
> >>> Message: 1
> >>> Date: Tue, 30 Nov 2010 18:56:09 +0000
> >>> From: James Benstead <[email protected]>
> >>> Subject: [development] Drupal module for scraping information from an
> >>>        HTML/XML document
> >>> To: development <[email protected]>
> >>> Message-ID:
> >>>        
> >>> <[email protected]<afhbkvyurzgwnb54z%[email protected]>
> >
> >>> Content-Type: text/plain; charset="iso-8859-1"
> >>>
> >>> I've finally got round to doing some serious work on Drupalversity, an
> >>> open,
> >>> web-based Drupal education project I've had in mind for a year or so.
> >>>
> >>> People who use Drupalversity to learn have the option of adding
> Resources
> >>> to
> >>> the site - i.e., links to posts at Lullabot, Chapter3 etc that explain
> >>> how
> >>> to do specific things with Drupal. A Resource is a custom content type
> >>> that
> >>> includes a link to the resource and a text field containing a
> description
> >>> of
> >>> that resource.
> >>>
> >>> What I'd like to do once a Resource has been added to the site is to
> >>> scrape
> >>> certain information from it: at this point I'm thinking the Title of
> the
> >>> page the link points to and the provider of the resource - e.g., which
> >>> Drupal shop originally created the resource. What's the best way to go
> >>> about
> >>> doing this? I'm pretty sure there's not a Drupal module that solves the
> >>> problem out of the box.
> >>>
> >>> So far I've considered:
> >>>
> >>>   - http://drupal.org/project/querypath
> >>>   - Drupal's built-in drupal_http_request() -
> >>>
> >>>
> http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_http_request/6
> >>>   - curl
> >>>
> >>> Thanks,
> >>>
> >>> --Jim
> >>> --
> >>> My IM and Skype details are at http://state68.com/contact
> >>> -------------- next part --------------
> >>> An HTML attachment was scrubbed...
> >>> URL:
> >>>
> http://lists.drupal.org/pipermail/development/attachments/20101130/5600f1fe/attachment-0001.html
> >>>
> >>> ------------------------------
> >>>
> >>> Message: 2
> >>> Date: Tue, 30 Nov 2010 12:06:33 -0700
> >>> From: John Fiala <[email protected]>
> >>> Subject: Re: [development] Drupal module for scraping information from
> >>>        an HTML/XML document
> >>> To: [email protected]
> >>> Message-ID:
> >>>        <[email protected]>
> >>> Content-Type: text/plain; charset=ISO-8859-1
> >>>
> >>> These days, if I'm going to be trying to extract data from html/xml,
> >>> I'd use querypath.  Give it a try!
> >>>
> >>> On Tue, Nov 30, 2010 at 11:56 AM, James Benstead
> >>> <[email protected]> wrote:
> >>> > What I'd like to do once a Resource has been added to the site is to
> >>> > scrape
> >>> > certain information from it: at this point I'm thinking the Title of
> >>> > the
> >>> > page the link points to and the provider of the resource - e.g.,
> which
> >>> > Drupal shop originally created the resource. What's the best way to
> go
> >>> > about
> >>> > doing this? I'm pretty sure there's not a Drupal module that solves
> the
> >>> > problem out of the box.
> >>>
> >>> --
> >>> John Fiala
> >>> www.jcfiala.net
> >>>
> >>>
> >>> ------------------------------
> >>>
> >>> Message: 3
> >>> Date: Tue, 30 Nov 2010 20:14:04 +0100
> >>> From: ?mon Tam?s <[email protected]>
> >>> Subject: [development] Easter problem
> >>> To: [email protected]
> >>> Message-ID:
> >>>        
> >>> <[email protected]<aanlktikmkovkedks2fkwubhrq9snte6r0ix%[email protected]>
> >
> >>> Content-Type: text/plain; charset="utf-8"
> >>>
> >>> Hello,
> >>>
> >>> I have the nameday module (http://drupal.org/project/nameday) and I
> get a
> >>> feature request for the Greek namedays. How I see it is based on the
> >>> Easter,
> >>> what is not an easy thing to count.
> >>>
> >>> Well, I want to find some algorithm for Easter, and similar days, what
> is
> >>> can be stored somehow. Maybe it should be a hook or some other think
> what
> >>> can be stored in database.
> >>>
> >>>
> >>> Thanks
> >>>
> >>> --
> >>> ?mon Tam?s
> >>> Sitefejleszt? ?s programoz?
> >>> -------------- next part --------------
> >>> An HTML attachment was scrubbed...
> >>> URL:
> >>>
> http://lists.drupal.org/pipermail/development/attachments/20101130/c81e61bf/attachment-0001.html
> >>>
> >>> ------------------------------
> >>>
> >>> Message: 4
> >>> Date: Tue, 30 Nov 2010 12:22:42 -0700
> >>> From: Carl Wiedemann <[email protected]>
> >>> Subject: Re: [development] Easter problem
> >>> To: [email protected]
> >>> Message-ID:
> >>>        <[email protected]>
> >>> Content-Type: text/plain; charset="iso-8859-2"
> >>>
> >>> Does this help? http://php.net/manual/en/function.easter-days.php
> >>>
> >>> On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <[email protected]> wrote:
> >>>
> >>> > Hello,
> >>> >
> >>> > I have the nameday module (http://drupal.org/project/nameday) and I
> get
> >>> > a
> >>> > feature request for the Greek namedays. How I see it is based on the
> >>> > Easter,
> >>> > what is not an easy thing to count.
> >>> >
> >>> > Well, I want to find some algorithm for Easter, and similar days,
> what
> >>> > is
> >>> > can be stored somehow. Maybe it should be a hook or some other think
> >>> > what
> >>> > can be stored in database.
> >>> >
> >>> >
> >>> > Thanks
> >>> >
> >>> > --
> >>> > ?mon Tam?s
> >>> > Sitefejleszt? ?s programoz?
> >>> >
> >>> >
> >>> -------------- next part --------------
> >>> An HTML attachment was scrubbed...
> >>> URL:
> >>>
> http://lists.drupal.org/pipermail/development/attachments/20101130/55b0fb8a/attachment-0001.html
> >>>
> >>> ------------------------------
> >>>
> >>> Message: 5
> >>> Date: Tue, 30 Nov 2010 13:24:07 -0600
> >>> From: "[email protected]" <[email protected]>
> >>> Subject: Re: [development] Easter problem
> >>> To: [email protected]
> >>> Message-ID: <[email protected]>
> >>> Content-Type: text/plain; charset=UTF-8; format=flowed
> >>>
> >>> There's no need for a hook here at all.  You can either code in the
> >>> algorithm for defining when Easter is (which sounds like it is in fact
> >>> rather complicated) or just pre-store know pre-calculated dates for it
> >>> for the next decade or so.  (10 records, one per year; totally easy.)
> >>>
> >>> Both options are described here, including the different mechanisms for
> >>> defining when Easter is in different calendars:
> >>>
> >>> http://en.wikipedia.org/wiki/Easter#Date_of_Easter
> >>>
> >>> --Larry Garfield
> >>>
> >>> On 11/30/10 1:14 PM, ?mon Tam?s wrote:
> >>> > Hello,
> >>> >
> >>> > I have the nameday module (http://drupal.org/project/nameday) and I
> get
> >>> > a feature request for the Greek namedays. How I see it is based on
> the
> >>> > Easter, what is not an easy thing to count.
> >>> >
> >>> > Well, I want to find some algorithm for Easter, and similar days,
> what
> >>> > is can be stored somehow. Maybe it should be a hook or some other
> think
> >>> > what can be stored in database.
> >>> >
> >>> >
> >>> > Thanks
> >>> >
> >>> > --
> >>> > ?mon Tam?s
> >>> > Sitefejleszt? ?s programoz?
> >>> >
> >>>
> >>>
> >>> ------------------------------
> >>>
> >>> Message: 6
> >>> Date: Tue, 30 Nov 2010 14:23:56 -0500
> >>> From: [email protected]
> >>> Subject: Re: [development] Easter problem
> >>> To: [email protected]
> >>> Message-ID: <[email protected]>
> >>> Content-Type: text/plain; charset="utf-8"
> >>>
> >>> You can google it, but I believe this is one of those things that
> cannot
> >>> be reduced to an equation or algorithm. It's something like the first
> >>> Sunday after the first full moon after the spring equinox.
> >>>
> >>> On 11/30/2010 02:14 PM, ?mon Tam?s wrote:
> >>> > Hello,
> >>> >
> >>> > I have the nameday module ( http://drupal.org/project/nameday) and I
> >>> > get a feature request for the Greek namedays. How I see it is based
> on
> >>> > the Easter, what is not an easy thing to count.
> >>> >
> >>> > Well, I want to find some algorithm for Easter, and similar days,
> what
> >>> > is can be stored somehow. Maybe it should be a hook or some other
> >>> > think what can be stored in database.
> >>> >
> >>> >
> >>> > Thanks
> >>> >
> >>> > --
> >>> > ?mon Tam?s
> >>> > Sitefejleszt? ?s programoz?
> >>> >
> >>> -------------- next part --------------
> >>> An HTML attachment was scrubbed...
> >>> URL:
> >>>
> http://lists.drupal.org/pipermail/development/attachments/20101130/38791578/attachment-0001.html
> >>>
> >>> ------------------------------
> >>>
> >>> Message: 7
> >>> Date: Tue, 30 Nov 2010 13:26:23 -0600
> >>> From: "[email protected]" <[email protected]>
> >>> Subject: Re: [development] Easter problem
> >>> To: [email protected]
> >>> Message-ID: <[email protected]>
> >>> Content-Type: text/plain; charset=ISO-8859-2; format=flowed
> >>>
> >>> The Calendar PHP module is not enabled by default in a stock PHP, so I
> >>> don't know that you can rely on it (unfortunately).  It does have some
> >>> cool stuff in it, though.
> >>>
> >>> --Larry Garfield
> >>>
> >>> On 11/30/10 1:22 PM, Carl Wiedemann wrote:
> >>> > Does this help? http://php.net/manual/en/function.easter-days.php
> >>> >
> >>> > On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <[email protected]
> >>> > <mailto:[email protected]>> wrote:
> >>> >
> >>> >     Hello,
> >>> >
> >>> >     I have the nameday module (http://drupal.org/project/nameday)
> and I
> >>> >     get a feature request for the Greek namedays. How I see it is
> based
> >>> >     on the Easter, what is not an easy thing to count.
> >>> >
> >>> >     Well, I want to find some algorithm for Easter, and similar days,
> >>> >     what is can be stored somehow. Maybe it should be a hook or some
> >>> >     other think what can be stored in database.
> >>> >
> >>> >
> >>> >     Thanks
> >>> >
> >>> >     --
> >>> >     ?mon Tam?s
> >>> >     Sitefejleszt? ?s programoz?
> >>> >
> >>> >
> >>>
> >>>
> >>> ------------------------------
> >>>
> >>> Message: 8
> >>> Date: Tue, 30 Nov 2010 11:21:08 -0800
> >>> From: Jennifer Hodgdon <[email protected]>
> >>> Subject: Re: [development] Easter problem
> >>> To: [email protected]
> >>> Message-ID: <[email protected]>
> >>> Content-Type: text/plain; charset=UTF-8; format=flowed
> >>>
> >>> http://php.net/manual/en/function.easter-date.php
> >>>
> >>> On 11/30/2010 11:14 AM, ?mon Tam?s wrote:
> >>> > I have the nameday module (http://drupal.org/project/nameday) and I
> get
> >>> > a
> >>> > feature request for the Greek namedays. How I see it is based on the
> >>> > Easter,
> >>> > what is not an easy thing to count.
> >>> >
> >>> > Well, I want to find some algorithm for Easter, and similar days,
> what
> >>> > is
> >>> > can be stored somehow. Maybe it should be a hook or some other think
> >>> > what
> >>> > can be stored in database.
> >>>
> >>> --
> >>> Jennifer Hodgdon * Poplar ProductivityWare
> >>> www.poplarware.com
> >>> Drupal web sites and custom Drupal modules
> >>>
> >>>
> >>>
> >>> ------------------------------
> >>>
> >>> --
> >>> [ Drupal development list | http://lists.drupal.org/ ]
> >>>
> >>> End of development Digest, Vol 95, Issue 58
> >>> *******************************************
> >>
> >
> >
>

Reply via email to