On May 7, 2009, Dan Lynch wrote:
> I think Gordon's original problem is that BeautifulSoup doesn't
> work properly for him on Debian (I think it was Debian). He's
> tried the Lastscrape Python script already if I read the
> original comment correctly. He would prefer something written
> in Perl which is why he's doing this. That's how I understood
> it but obviously I can't speak for Gordon, I'm sure he'll reply
> soon. Just wanted to explain the situation.

After spending 4 days changing a water pump (domestic water supply 
at the farm), I'm back working on things.

Yes, I could install a deprecated version of BeautifulSoup and 
scrape my information off Last.fm.  If I had much experience with 
Python, I could fix the scraper so that it would work with the up 
to date BeautifulSoup.

I can certainly understand why something like a scraper is 
sensitive to things which are HTML, or look like it.  I've 
written programs before to clean it (HTML) up, and it isn't nice.  
The best program I've seen for cleaning HTML, is HTML Tidy from 
W3C.

Anyway, while the CPAN module WebService::LastFM has a function to 
get a tracklist, it is a random sample of 10 tracks you have 
listened to.  After spending a couple of days looking at the 
code, I see no easy way to get it to produce the entire tracklist 
you want.

On the Perl side, XML::Simple couldn't parse a sample download of 
1 page of my own data.  XML::Twig also can't parse it, but at 
least it would give me messages as to where the problem was.  So, 
I tried the incremental edit method.  There seems to be several 
different styles of sub trees present in any given page.  Some of 
the trees are parsed without problems, so with easy to fix 
problems, and some are more difficult.  Things like escaping 
double quotation marks, and escaping forward slashes were causing 
problems.  I eventually ran into a problem near the bottom of the 
page which would require a bit of work to fix, and thought about 
other means.

HTML Tidy is freely available for many platforms, and is fast (I 
think it is written in C).  If I run Tidy to increment and clean, 
and to assume UTF8 input and generate UTF8 output. I almost get a 
file which XML::Twig will process.  There are a couple of 
attributes of elements in the page which are empty (such as 
alt=""), which XML::Twig thinks are duplicates.  And XML::Twig 
doesn't understand   for some reason.  Deleting the empty 
attributes from the text of the page, and changing   into a 
space are enough to get XML::Twig to parse the file.

Twig is meant to process big files.  Having an almost 2000 line 
HTML file to describe 51 lines of data is excessive, and so Twig 
makes sense.  It is easy to have Twig delete comments, script, 
noscript and form elements (and all their children) when it is 
parsing the file.  The last thing was to get the ISO timestamp 
out of the title attribute of the abbr element in the table, and 
with Twig one can provide a subroutine which overwrites the text 
content of the abbr element with the value in the title 
attribute.

Twig understands XPath, and now all of the data you want in your 
file is easily found using XPath.  So, I am left with writing the 
XPath part, and putting the whole thing in a loop so that I can 
download the 5629 pages of data that I have at Last.fm.  (Well, 
you can use XPath to extract where it is, but I was looking at 
having all of the data of interest in the text part of elements, 
which is not strictly needed.)

In terms of instructing people interested in libre.fm about how to 
get their data, sure you can keep your current method.  And maybe 
sometime the current version of BeautifulSoup actually works.  
And then something else happens.

What seems better, and I don't care if you like Python, Perl, Ruby 
or ksh.  Have HTML Tidy clean up the page, then process it with 
whatever XML methods you like.  The cleanup by tidy removes about 
20% of the file that I was working with.  And you are probably 
left with a file which you don't need to recommend that people 
have deprecated parsers installed, to work with the file.  It's 
about as bad as telling someone today that you need PC-DOS-3.09 
to run some program.

Gord
_______________________________________________
Libre-fm mailing list
[email protected]
http://lists.autonomo.us/mailman/listinfo/libre-fm

Reply via email to