Thanks guys - looks like QueryPath is the way forward :) --Jim -- My IM and Skype details are at http://state68.com/contact
On 1 December 2010 11:42, Victor Kane <[email protected]> wrote: > Well, Python has "Beautiful Soup". > > http://www.crummy.com/software/BeautifulSoup/ > > "You didn't write that awful page. You're just trying to get some data out > of it. Right now, you don't really care what HTML is supposed to look like. > > Neither does this parser" > In PHP I have use http://simplehtmldom.sourceforge.net/ as a way of > parsing badly formed HTML. > > I wrote a script to import nodes using the latter and then saved them with > "node_save()". > > An alternative could be to parse to CSV, then import using the node_export > or node_import modules. > > Hope that helps, > > Victor Kane > http://awebfactory.com.ar > http://projectflowandtracker.com > > > On Wed, Dec 1, 2010 at 7:46 AM, Balazs Dianiska <[email protected]>wrote: > >> Sadly some of the older legacy sites are just not available in rss, I >> had such a scraping request recently. I have to say that with >> drupal_http_request you don't even have to look at curl. You can do >> all sorts of things, even faking logins. >> >> To parse the HTML use querypath, a trick that we use is to first run >> some sort of HTML tidyup library on the downloaded page, otherwise >> querypath runs away crying. beautify module can help you a great deal >> with that. >> >> Balazs >> >> On Wed, Dec 1, 2010 at 5:27 AM, Cameron Eagans <[email protected]> >> wrote: >> > Most of the time, you can get to the posts via RSS. Aggregator module >> does a >> > pretty good job of pulling stuff in, and the author of the post that's >> > displayed is whatever you tell it to display (see Drupal Planet for an >> > example) >> > Thanks, >> > Cameron >> > >> > >> > >> > On Tue, Nov 30, 2010 at 12:48, Kevin O <[email protected]> >> wrote: >> >> >> >> I second the recommendation of using QueryPath. I use it almost >> >> exclusively along with drupal_http_request, though I use curl only in a >> few >> >> places (if you use curl I recommend http://drupal.org/project/curl for >> a >> >> dependency check). I'd really recommend though creating a custom module >> that >> >> uses the above and then has your logic for filtering in it, I've done >> this >> >> for about a dozen modules now. >> >> That said, there are some more modules available out there nowadays, >> such >> >> as using http://drupal.org/project/feeds_xpathparser with feeds >> >> http://drupal.org/project/feeds There are about a dozen more modules >> that >> >> will accomplish the goal though I haven't used them, but I went through >> and >> >> tried most of the methods out for some recent projects. >> >> Cheers, >> >> Kevin O'Brien >> >> Drupal Developer >> >> http://www.coderintherye.com >> >> 415-754-0112 >> >> >> >> >> >> On Tue, Nov 30, 2010 at 11:26 AM, <[email protected]> >> wrote: >> >>> >> >>> Send development mailing list submissions to >> >>> [email protected] >> >>> >> >>> To subscribe or unsubscribe via the World Wide Web, visit >> >>> http://lists.drupal.org/mailman/listinfo/development >> >>> or, via email, send a message with subject or body 'help' to >> >>> [email protected] >> >>> >> >>> You can reach the person managing the list at >> >>> [email protected] >> >>> >> >>> When replying, please edit your Subject line so it is more specific >> >>> than "Re: Contents of development digest..." >> >>> >> >>> >> >>> Today's Topics: >> >>> >> >>> 1. Drupal module for scraping information from an HTML/XML >> >>> document (James Benstead) >> >>> 2. Re: Drupal module for scraping information from an HTML/XML >> >>> document (John Fiala) >> >>> 3. Easter problem (?mon Tam?s) >> >>> 4. Re: Easter problem (Carl Wiedemann) >> >>> 5. Re: Easter problem ([email protected]) >> >>> 6. Re: Easter problem ([email protected]) >> >>> 7. Re: Easter problem ([email protected]) >> >>> 8. Re: Easter problem (Jennifer Hodgdon) >> >>> >> >>> >> >>> ---------------------------------------------------------------------- >> >>> >> >>> Message: 1 >> >>> Date: Tue, 30 Nov 2010 18:56:09 +0000 >> >>> From: James Benstead <[email protected]> >> >>> Subject: [development] Drupal module for scraping information from an >> >>> HTML/XML document >> >>> To: development <[email protected]> >> >>> Message-ID: >> >>> >> >>> <[email protected]<afhbkvyurzgwnb54z%[email protected]> >> > >> >>> Content-Type: text/plain; charset="iso-8859-1" >> >>> >> >>> I've finally got round to doing some serious work on Drupalversity, an >> >>> open, >> >>> web-based Drupal education project I've had in mind for a year or so. >> >>> >> >>> People who use Drupalversity to learn have the option of adding >> Resources >> >>> to >> >>> the site - i.e., links to posts at Lullabot, Chapter3 etc that explain >> >>> how >> >>> to do specific things with Drupal. A Resource is a custom content type >> >>> that >> >>> includes a link to the resource and a text field containing a >> description >> >>> of >> >>> that resource. >> >>> >> >>> What I'd like to do once a Resource has been added to the site is to >> >>> scrape >> >>> certain information from it: at this point I'm thinking the Title of >> the >> >>> page the link points to and the provider of the resource - e.g., which >> >>> Drupal shop originally created the resource. What's the best way to go >> >>> about >> >>> doing this? I'm pretty sure there's not a Drupal module that solves >> the >> >>> problem out of the box. >> >>> >> >>> So far I've considered: >> >>> >> >>> - http://drupal.org/project/querypath >> >>> - Drupal's built-in drupal_http_request() - >> >>> >> >>> >> http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_http_request/6 >> >>> - curl >> >>> >> >>> Thanks, >> >>> >> >>> --Jim >> >>> -- >> >>> My IM and Skype details are at http://state68.com/contact >> >>> -------------- next part -------------- >> >>> An HTML attachment was scrubbed... >> >>> URL: >> >>> >> http://lists.drupal.org/pipermail/development/attachments/20101130/5600f1fe/attachment-0001.html >> >>> >> >>> ------------------------------ >> >>> >> >>> Message: 2 >> >>> Date: Tue, 30 Nov 2010 12:06:33 -0700 >> >>> From: John Fiala <[email protected]> >> >>> Subject: Re: [development] Drupal module for scraping information from >> >>> an HTML/XML document >> >>> To: [email protected] >> >>> Message-ID: >> >>> <[email protected]> >> >>> Content-Type: text/plain; charset=ISO-8859-1 >> >>> >> >>> These days, if I'm going to be trying to extract data from html/xml, >> >>> I'd use querypath. Give it a try! >> >>> >> >>> On Tue, Nov 30, 2010 at 11:56 AM, James Benstead >> >>> <[email protected]> wrote: >> >>> > What I'd like to do once a Resource has been added to the site is to >> >>> > scrape >> >>> > certain information from it: at this point I'm thinking the Title of >> >>> > the >> >>> > page the link points to and the provider of the resource - e.g., >> which >> >>> > Drupal shop originally created the resource. What's the best way to >> go >> >>> > about >> >>> > doing this? I'm pretty sure there's not a Drupal module that solves >> the >> >>> > problem out of the box. >> >>> >> >>> -- >> >>> John Fiala >> >>> www.jcfiala.net >> >>> >> >>> >> >>> ------------------------------ >> >>> >> >>> Message: 3 >> >>> Date: Tue, 30 Nov 2010 20:14:04 +0100 >> >>> From: ?mon Tam?s <[email protected]> >> >>> Subject: [development] Easter problem >> >>> To: [email protected] >> >>> Message-ID: >> >>> >> >>> <[email protected]<aanlktikmkovkedks2fkwubhrq9snte6r0ix%[email protected]> >> > >> >>> Content-Type: text/plain; charset="utf-8" >> >>> >> >>> Hello, >> >>> >> >>> I have the nameday module (http://drupal.org/project/nameday) and I >> get a >> >>> feature request for the Greek namedays. How I see it is based on the >> >>> Easter, >> >>> what is not an easy thing to count. >> >>> >> >>> Well, I want to find some algorithm for Easter, and similar days, what >> is >> >>> can be stored somehow. Maybe it should be a hook or some other think >> what >> >>> can be stored in database. >> >>> >> >>> >> >>> Thanks >> >>> >> >>> -- >> >>> ?mon Tam?s >> >>> Sitefejleszt? ?s programoz? >> >>> -------------- next part -------------- >> >>> An HTML attachment was scrubbed... >> >>> URL: >> >>> >> http://lists.drupal.org/pipermail/development/attachments/20101130/c81e61bf/attachment-0001.html >> >>> >> >>> ------------------------------ >> >>> >> >>> Message: 4 >> >>> Date: Tue, 30 Nov 2010 12:22:42 -0700 >> >>> From: Carl Wiedemann <[email protected]> >> >>> Subject: Re: [development] Easter problem >> >>> To: [email protected] >> >>> Message-ID: >> >>> <[email protected]> >> >>> Content-Type: text/plain; charset="iso-8859-2" >> >>> >> >>> Does this help? http://php.net/manual/en/function.easter-days.php >> >>> >> >>> On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <[email protected]> wrote: >> >>> >> >>> > Hello, >> >>> > >> >>> > I have the nameday module (http://drupal.org/project/nameday) and I >> get >> >>> > a >> >>> > feature request for the Greek namedays. How I see it is based on the >> >>> > Easter, >> >>> > what is not an easy thing to count. >> >>> > >> >>> > Well, I want to find some algorithm for Easter, and similar days, >> what >> >>> > is >> >>> > can be stored somehow. Maybe it should be a hook or some other think >> >>> > what >> >>> > can be stored in database. >> >>> > >> >>> > >> >>> > Thanks >> >>> > >> >>> > -- >> >>> > ?mon Tam?s >> >>> > Sitefejleszt? ?s programoz? >> >>> > >> >>> > >> >>> -------------- next part -------------- >> >>> An HTML attachment was scrubbed... >> >>> URL: >> >>> >> http://lists.drupal.org/pipermail/development/attachments/20101130/55b0fb8a/attachment-0001.html >> >>> >> >>> ------------------------------ >> >>> >> >>> Message: 5 >> >>> Date: Tue, 30 Nov 2010 13:24:07 -0600 >> >>> From: "[email protected]" <[email protected]> >> >>> Subject: Re: [development] Easter problem >> >>> To: [email protected] >> >>> Message-ID: <[email protected]> >> >>> Content-Type: text/plain; charset=UTF-8; format=flowed >> >>> >> >>> There's no need for a hook here at all. You can either code in the >> >>> algorithm for defining when Easter is (which sounds like it is in fact >> >>> rather complicated) or just pre-store know pre-calculated dates for it >> >>> for the next decade or so. (10 records, one per year; totally easy.) >> >>> >> >>> Both options are described here, including the different mechanisms >> for >> >>> defining when Easter is in different calendars: >> >>> >> >>> http://en.wikipedia.org/wiki/Easter#Date_of_Easter >> >>> >> >>> --Larry Garfield >> >>> >> >>> On 11/30/10 1:14 PM, ?mon Tam?s wrote: >> >>> > Hello, >> >>> > >> >>> > I have the nameday module (http://drupal.org/project/nameday) and I >> get >> >>> > a feature request for the Greek namedays. How I see it is based on >> the >> >>> > Easter, what is not an easy thing to count. >> >>> > >> >>> > Well, I want to find some algorithm for Easter, and similar days, >> what >> >>> > is can be stored somehow. Maybe it should be a hook or some other >> think >> >>> > what can be stored in database. >> >>> > >> >>> > >> >>> > Thanks >> >>> > >> >>> > -- >> >>> > ?mon Tam?s >> >>> > Sitefejleszt? ?s programoz? >> >>> > >> >>> >> >>> >> >>> ------------------------------ >> >>> >> >>> Message: 6 >> >>> Date: Tue, 30 Nov 2010 14:23:56 -0500 >> >>> From: [email protected] >> >>> Subject: Re: [development] Easter problem >> >>> To: [email protected] >> >>> Message-ID: <[email protected]> >> >>> Content-Type: text/plain; charset="utf-8" >> >>> >> >>> You can google it, but I believe this is one of those things that >> cannot >> >>> be reduced to an equation or algorithm. It's something like the first >> >>> Sunday after the first full moon after the spring equinox. >> >>> >> >>> On 11/30/2010 02:14 PM, ?mon Tam?s wrote: >> >>> > Hello, >> >>> > >> >>> > I have the nameday module ( http://drupal.org/project/nameday) and >> I >> >>> > get a feature request for the Greek namedays. How I see it is based >> on >> >>> > the Easter, what is not an easy thing to count. >> >>> > >> >>> > Well, I want to find some algorithm for Easter, and similar days, >> what >> >>> > is can be stored somehow. Maybe it should be a hook or some other >> >>> > think what can be stored in database. >> >>> > >> >>> > >> >>> > Thanks >> >>> > >> >>> > -- >> >>> > ?mon Tam?s >> >>> > Sitefejleszt? ?s programoz? >> >>> > >> >>> -------------- next part -------------- >> >>> An HTML attachment was scrubbed... >> >>> URL: >> >>> >> http://lists.drupal.org/pipermail/development/attachments/20101130/38791578/attachment-0001.html >> >>> >> >>> ------------------------------ >> >>> >> >>> Message: 7 >> >>> Date: Tue, 30 Nov 2010 13:26:23 -0600 >> >>> From: "[email protected]" <[email protected]> >> >>> Subject: Re: [development] Easter problem >> >>> To: [email protected] >> >>> Message-ID: <[email protected]> >> >>> Content-Type: text/plain; charset=ISO-8859-2; format=flowed >> >>> >> >>> The Calendar PHP module is not enabled by default in a stock PHP, so I >> >>> don't know that you can rely on it (unfortunately). It does have some >> >>> cool stuff in it, though. >> >>> >> >>> --Larry Garfield >> >>> >> >>> On 11/30/10 1:22 PM, Carl Wiedemann wrote: >> >>> > Does this help? http://php.net/manual/en/function.easter-days.php >> >>> > >> >>> > On Tue, Nov 30, 2010 at 12:14 PM, ?mon Tam?s <[email protected] >> >>> > <mailto:[email protected]>> wrote: >> >>> > >> >>> > Hello, >> >>> > >> >>> > I have the nameday module (http://drupal.org/project/nameday) >> and I >> >>> > get a feature request for the Greek namedays. How I see it is >> based >> >>> > on the Easter, what is not an easy thing to count. >> >>> > >> >>> > Well, I want to find some algorithm for Easter, and similar >> days, >> >>> > what is can be stored somehow. Maybe it should be a hook or some >> >>> > other think what can be stored in database. >> >>> > >> >>> > >> >>> > Thanks >> >>> > >> >>> > -- >> >>> > ?mon Tam?s >> >>> > Sitefejleszt? ?s programoz? >> >>> > >> >>> > >> >>> >> >>> >> >>> ------------------------------ >> >>> >> >>> Message: 8 >> >>> Date: Tue, 30 Nov 2010 11:21:08 -0800 >> >>> From: Jennifer Hodgdon <[email protected]> >> >>> Subject: Re: [development] Easter problem >> >>> To: [email protected] >> >>> Message-ID: <[email protected]> >> >>> Content-Type: text/plain; charset=UTF-8; format=flowed >> >>> >> >>> http://php.net/manual/en/function.easter-date.php >> >>> >> >>> On 11/30/2010 11:14 AM, ?mon Tam?s wrote: >> >>> > I have the nameday module (http://drupal.org/project/nameday) and I >> get >> >>> > a >> >>> > feature request for the Greek namedays. How I see it is based on the >> >>> > Easter, >> >>> > what is not an easy thing to count. >> >>> > >> >>> > Well, I want to find some algorithm for Easter, and similar days, >> what >> >>> > is >> >>> > can be stored somehow. Maybe it should be a hook or some other think >> >>> > what >> >>> > can be stored in database. >> >>> >> >>> -- >> >>> Jennifer Hodgdon * Poplar ProductivityWare >> >>> www.poplarware.com >> >>> Drupal web sites and custom Drupal modules >> >>> >> >>> >> >>> >> >>> ------------------------------ >> >>> >> >>> -- >> >>> [ Drupal development list | http://lists.drupal.org/ ] >> >>> >> >>> End of development Digest, Vol 95, Issue 58 >> >>> ******************************************* >> >> >> > >> > >> > >
