Sadly some of the older legacy sites are just not available in rss, I had such a scraping request recently. I have to say that with drupal_http_request you don't even have to look at curl. You can do all sorts of things, even faking logins.
To parse the HTML use querypath, a trick that we use is to first run some sort of HTML tidyup library on the downloaded page, otherwise querypath runs away crying. beautify module can help you a great deal with that. Balazs On Wed, Dec 1, 2010 at 5:27 AM, Cameron Eagans <[email protected]> wrote: > Most of the time, you can get to the posts via RSS. Aggregator module does a > pretty good job of pulling stuff in, and the author of the post that's > displayed is whatever you tell it to display (see Drupal Planet for an > example) > Thanks, > Cameron > > > > On Tue, Nov 30, 2010 at 12:48, Kevin O <[email protected]> wrote: >> >> I second the recommendation of using QueryPath. I use it almost >> exclusively along with drupal_http_request, though I use curl only in a few >> places (if you use curl I recommend http://drupal.org/project/curl for a >> dependency check). I'd really recommend though creating a custom module that >> uses the above and then has your logic for filtering in it, I've done this >> for about a dozen modules now. >> That said, there are some more modules available out there nowadays, such >> as using http://drupal.org/project/feeds_xpathparser with feeds >> http://drupal.org/project/feeds There are about a dozen more modules that >> will accomplish the goal though I haven't used them, but I went through and >> tried most of the methods out for some recent projects. >> Cheers, >> Kevin O'Brien >> Drupal Developer >> http://www.coderintherye.com >> 415-754-0112 >> >> >> On Tue, Nov 30, 2010 at 11:26 AM, <[email protected]> wrote: >>> >>> Send development mailing list submissions to >>> [email protected] >>> >>> To subscribe or unsubscribe via the World Wide Web, visit >>> http://lists.drupal.org/mailman/listinfo/development >>> or, via email, send a message with subject or body 'help' to >>> [email protected] >>> >>> You can reach the person managing the list at >>> [email protected] >>> >>> When replying, please edit your Subject line so it is more specific >>> than "Re: Contents of development digest..." >>> >>> >>> Today's Topics: >>> >>> 1. Drupal module for scraping information from an HTML/XML >>> document (James Benstead) >>> 2. Re: Drupal module for scraping information from an HTML/XML >>> document (John Fiala) >>> 3. Easter problem (?mon Tam?s) >>> 4. Re: Easter problem (Carl Wiedemann) >>> 5. Re: Easter problem ([email protected]) >>> 6. Re: Easter problem ([email protected]) >>> 7. Re: Easter problem ([email protected]) >>> 8. Re: Easter problem (Jennifer Hodgdon) >>> >>> >>> ---------------------------------------------------------------------- >>> >>> Message: 1 >>> Date: Tue, 30 Nov 2010 18:56:09 +0000 >>> From: James Benstead <[email protected]> >>> Subject: [development] Drupal module for scraping information from an >>> HTML/XML document >>> To: development <[email protected]> >>> Message-ID: >>> <[email protected]> >>> Content-Type: text/plain; charset="iso-8859-1" >>> >>> I've finally got round to doing some serious work on Drupalversity, an >>> open, >>> web-based Drupal education project I've had in mind for a year or so. >>> >>> People who use Drupalversity to learn have the option of adding Resources >>> to >>> the site - i.e., links to posts at Lullabot, Chapter3 etc that explain >>> how >>> to do specific things with Drupal. A Resource is a custom content type >>> that >>> includes a link to the resource and a text field containing a description >>> of >>> that resource. >>> >>> What I'd like to do once a Resource has been added to the site is to >>> scrape >>> certain information from it: at this point I'm thinking the Title of the >>> page the link points to and the provider of the resource - e.g., which >>> Drupal shop originally created the resource. What's the best way to go >>> about >>> doing this? I'm pretty sure there's not a Drupal module that solves the >>> problem out of the box. >>> >>> So far I've considered: >>> >>> - http://drupal.org/project/querypath >>> - Drupal's built-in drupal_http_request() - >>> >>> http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_http_request/6 >>> - curl >>> >>> Thanks, >>> >>> --Jim >>> -- >>> My IM and Skype details are at http://state68.com/contact >>> -------------- next part -------------- >>> An HTML attachment was scrubbed... >>> URL: >>> http://lists.drupal.org/pipermail/development/attachments/20101130/5600f1fe/attachment-0001.html >>> >>> ------------------------------ >>> >>> Message: 2 >>> Date: Tue, 30 Nov 2010 12:06:33 -0700 >>> From: John Fiala <[email protected]> >>> Subject: Re: [development] Drupal module for scraping information from >>> an HTML/XML document >>> To: [email protected] >>> Message-ID: >>> <[email protected]> >>> Content-Type: text/plain; charset=ISO-8859-1 >>> >>> These days, if I'm going to be trying to extract data from html/xml, >>> I'd use querypath. Give it a try! >>> >>> On Tue, Nov 30, 2010 at 11:56 AM, James Benstead >>> <[email protected]> wrote: >>> > What I'd like to do once a Resource has been added to the site is to >>> > scrape >>> > certain information from it: at this point I'm thinking the Title of >>> > the >>> > page the link points to and the provider of the resource - e.g., which >>> > Drupal shop originally created the resource. What's the best way to go >>> > about >>> > doing this? I'm pretty sure there's not a Drupal module that solves the >>> > problem out of the box. >>> >>> -- >>> John Fiala >>> www.jcfiala.net >>> >>> >>> ------------------------------ >>> >>> Message: 3 >>> Date: Tue, 30 Nov 2010 20:14:04 +0100 >>> From: ?mon Tam?s <[email protected]> >>> Subject: [development] Easter problem >>> To: [email protected] >>> Message-ID: >>> <[email protected]> >>> Content-Type: text/plain; charset="utf-8" >>> >>> Hello, >>> >>> I have the nameday module (http://drupal.org/project/nameday) and I get a >>> feature request for the Greek namedays. How I see it is based on the >>> Easter, >>> what is not an easy thing to count. >>> >>> Well, I want to find some algorithm for Easter, and similar days, what is >>> can be stored somehow. Maybe it should be a hook or some other think what >>> can be stored in database. >>> >>> >>> Thanks >>> >>> -- >>> ?mon Tam?s >>> Sitefejleszt? ?s programoz? >>> -------------- next part -------------- >>> An HTML attachment was scrubbed... >>> URL: >>> http://lists.drupal.org/pipermail/development/attachments/20101130/c81e61bf/attachment-0001.html >>> >>> ------------------------------ >>> >>> Message: 4 >>> Date: Tue, 30 Nov 2010 12:22:42 -0700 >>> From: Carl Wiedemann <[email protected]> >>> Subject: Re: [development] Easter problem >>> To: [email protected] >>> Message-ID: >>> <[email protected]> >>> Content-Type: text/plain; charset="iso-8859-2" >>> >>> Does this help? http://php.net/manual/en/function.easter-days.php >>> >>> On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <[email protected]> wrote: >>> >>> > Hello, >>> > >>> > I have the nameday module (http://drupal.org/project/nameday) and I get >>> > a >>> > feature request for the Greek namedays. How I see it is based on the >>> > Easter, >>> > what is not an easy thing to count. >>> > >>> > Well, I want to find some algorithm for Easter, and similar days, what >>> > is >>> > can be stored somehow. Maybe it should be a hook or some other think >>> > what >>> > can be stored in database. >>> > >>> > >>> > Thanks >>> > >>> > -- >>> > ?mon Tam?s >>> > Sitefejleszt? ?s programoz? >>> > >>> > >>> -------------- next part -------------- >>> An HTML attachment was scrubbed... >>> URL: >>> http://lists.drupal.org/pipermail/development/attachments/20101130/55b0fb8a/attachment-0001.html >>> >>> ------------------------------ >>> >>> Message: 5 >>> Date: Tue, 30 Nov 2010 13:24:07 -0600 >>> From: "[email protected]" <[email protected]> >>> Subject: Re: [development] Easter problem >>> To: [email protected] >>> Message-ID: <[email protected]> >>> Content-Type: text/plain; charset=UTF-8; format=flowed >>> >>> There's no need for a hook here at all. You can either code in the >>> algorithm for defining when Easter is (which sounds like it is in fact >>> rather complicated) or just pre-store know pre-calculated dates for it >>> for the next decade or so. (10 records, one per year; totally easy.) >>> >>> Both options are described here, including the different mechanisms for >>> defining when Easter is in different calendars: >>> >>> http://en.wikipedia.org/wiki/Easter#Date_of_Easter >>> >>> --Larry Garfield >>> >>> On 11/30/10 1:14 PM, ?mon Tam?s wrote: >>> > Hello, >>> > >>> > I have the nameday module (http://drupal.org/project/nameday) and I get >>> > a feature request for the Greek namedays. How I see it is based on the >>> > Easter, what is not an easy thing to count. >>> > >>> > Well, I want to find some algorithm for Easter, and similar days, what >>> > is can be stored somehow. Maybe it should be a hook or some other think >>> > what can be stored in database. >>> > >>> > >>> > Thanks >>> > >>> > -- >>> > ?mon Tam?s >>> > Sitefejleszt? ?s programoz? >>> > >>> >>> >>> ------------------------------ >>> >>> Message: 6 >>> Date: Tue, 30 Nov 2010 14:23:56 -0500 >>> From: [email protected] >>> Subject: Re: [development] Easter problem >>> To: [email protected] >>> Message-ID: <[email protected]> >>> Content-Type: text/plain; charset="utf-8" >>> >>> You can google it, but I believe this is one of those things that cannot >>> be reduced to an equation or algorithm. It's something like the first >>> Sunday after the first full moon after the spring equinox. >>> >>> On 11/30/2010 02:14 PM, ?mon Tam?s wrote: >>> > Hello, >>> > >>> > I have the nameday module ( http://drupal.org/project/nameday) and I >>> > get a feature request for the Greek namedays. How I see it is based on >>> > the Easter, what is not an easy thing to count. >>> > >>> > Well, I want to find some algorithm for Easter, and similar days, what >>> > is can be stored somehow. Maybe it should be a hook or some other >>> > think what can be stored in database. >>> > >>> > >>> > Thanks >>> > >>> > -- >>> > ?mon Tam?s >>> > Sitefejleszt? ?s programoz? >>> > >>> -------------- next part -------------- >>> An HTML attachment was scrubbed... >>> URL: >>> http://lists.drupal.org/pipermail/development/attachments/20101130/38791578/attachment-0001.html >>> >>> ------------------------------ >>> >>> Message: 7 >>> Date: Tue, 30 Nov 2010 13:26:23 -0600 >>> From: "[email protected]" <[email protected]> >>> Subject: Re: [development] Easter problem >>> To: [email protected] >>> Message-ID: <[email protected]> >>> Content-Type: text/plain; charset=ISO-8859-2; format=flowed >>> >>> The Calendar PHP module is not enabled by default in a stock PHP, so I >>> don't know that you can rely on it (unfortunately). It does have some >>> cool stuff in it, though. >>> >>> --Larry Garfield >>> >>> On 11/30/10 1:22 PM, Carl Wiedemann wrote: >>> > Does this help? http://php.net/manual/en/function.easter-days.php >>> > >>> > On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <[email protected] >>> > <mailto:[email protected]>> wrote: >>> > >>> > Hello, >>> > >>> > I have the nameday module (http://drupal.org/project/nameday) and I >>> > get a feature request for the Greek namedays. How I see it is based >>> > on the Easter, what is not an easy thing to count. >>> > >>> > Well, I want to find some algorithm for Easter, and similar days, >>> > what is can be stored somehow. Maybe it should be a hook or some >>> > other think what can be stored in database. >>> > >>> > >>> > Thanks >>> > >>> > -- >>> > ?mon Tam?s >>> > Sitefejleszt? ?s programoz? >>> > >>> > >>> >>> >>> ------------------------------ >>> >>> Message: 8 >>> Date: Tue, 30 Nov 2010 11:21:08 -0800 >>> From: Jennifer Hodgdon <[email protected]> >>> Subject: Re: [development] Easter problem >>> To: [email protected] >>> Message-ID: <[email protected]> >>> Content-Type: text/plain; charset=UTF-8; format=flowed >>> >>> http://php.net/manual/en/function.easter-date.php >>> >>> On 11/30/2010 11:14 AM, ?mon Tam?s wrote: >>> > I have the nameday module (http://drupal.org/project/nameday) and I get >>> > a >>> > feature request for the Greek namedays. How I see it is based on the >>> > Easter, >>> > what is not an easy thing to count. >>> > >>> > Well, I want to find some algorithm for Easter, and similar days, what >>> > is >>> > can be stored somehow. Maybe it should be a hook or some other think >>> > what >>> > can be stored in database. >>> >>> -- >>> Jennifer Hodgdon * Poplar ProductivityWare >>> www.poplarware.com >>> Drupal web sites and custom Drupal modules >>> >>> >>> >>> ------------------------------ >>> >>> -- >>> [ Drupal development list | http://lists.drupal.org/ ] >>> >>> End of development Digest, Vol 95, Issue 58 >>> ******************************************* >> > >
