Greetings, I think I have something to offer the world. Here's my info in
the order you asked for it at
http://www.cpan.org/modules/04pause.html#registering. Please consider
establishing a developer account for me. 

=== Info ===

Name        : Scott Thomason

Email       : [EMAIL PROTECTED]

Website     : industrial-linux.org

UserID      : SCOTTHOM

Description : XML::RSSLite   bdpf   
              Extract broken XML from RSS/RDF/SN/WL format

Background  : After recently reading about XML::RSS, I decided to give it a
try. Several hours later, I was still only capable of extracting about 40%
of the open content cataloged at xmltree.com. I realized that the
majority of open content syndicators are not very good at writing valid RSS
XML. 

My goal is to write a module that extracts all the open content that is
available, and be much less concerned about XML compliance. This module
currently parses all but a handful of the XML URLs cataloged at
xmltree.com. Rather than rely on XML::Parser, it uses heuristics and good
old-fashioned Perl regular expressions. It stores the data in a simple
hash structure, and "aliases" certain tags so that when done, you can
count on having the minimal data necessary for re-constructing a valid
RSS file. This means you get the basic title, description, and link for a
channel and its items. Anything else present in the hash is a bonus :)

This module extracts more usable links by parsing "scriptingNews" and
"weblog" formats in addition to RDF & RSS. It also "sanitizes" the for
best results. The munging includes:

  
  - Removing html tags to leave plain text
  - Eliminating objectionable characters
  - Using <url> tags when <link> is empty
  - Using misplaced urls in <title> when <link> is empty 
  - Ripping links from <a href=...> if required   
  - Limiting content to http/ftp links
  - Joining relative urls to the site base

In the near future I plan to strengthen the weblog parsing, and add a
routine to generate valid RSS output from the result hash via XML::RSS,
plus contact  

Reply via email to