linux-sound.org's data as csv was: Re: [linux-audio-dev] Linux soundapps pages updated

Jan Weil Mon, 19 Apr 2004 11:01:36 -0700

On Mon, 2004-04-12 at 18:29, Paul Winkler wrote:
> FYI, I'm still planning to implement my own proposal which has
> been discussed quite a lot in the L-A-U archives.
> I do somewhat similar sites for a living.  It just needs me to block out a 
> chunk of time (1 or 2 weekends) to bang it out.


Hi Paul,

I'd like to assist so I've written a little script to automatically
extract all the links from linux-sound.org.
It depends on Ruby, wget, lynx and sed.

The output is tab separated csv (tsv) containing three fields per row:
text, urls and category.
Some of the <li>s contain more than one url. For these the urls are
separated by blanks ' '.
The category is either the title (<h3>) of the subpage or the text which
belongs to the list item containing the current (<ul>).
The script expects at least one url from linux-sound.org as arguments
(one of the subpages). So you'll also need a working internet
connection.
A '-H' prints an additional csv header.
I also attached a bash script to extract all the subpages from
linux-sound.org.

If you have any problems with this script I can send you all the data
off list.

HTH,

Jan

P.S. Follow-up to LAU?

#!/usr/bin/env ruby

# This little piece of software is free in every sense of the word.
#  Mon, 19 Apr 2004, Jan Weil <[EMAIL PROTECTED]>


if ARGV.include?("-h") || ARGV.include?("--help") || ARGV.size == 0
        puts "usage: #{File.basename($0)} [-H] URL..."
        puts "-H --header\tadd csv header"
        exit
end

if ARGV.include?("-H") || ARGV.include?("--header")
        $print_header = true
        ARGV.delete("-H")
        ARGV.delete("--header")
end

def extract_urls(str)
        urls = []
        url_regex = /\[(\d+)\](\S.+)/
        loop do
                if str =~ url_regex
                        urls.push($reference[$1.to_i])
                        str.sub!(url_regex){|s| $2}
                else
                        break
                end
        end
        if not urls.empty?
                return urls.join(" ")
        else
                return false
        end
end

def push_li(line, level, regex)
        next_line = ""
        loop do
                next_line = $lines.pop
                if next_line =~ regex
                        line += " #{$1}"
                else
                        break
                end
        end
        $lines.push(next_line)
        urls = extract_urls(line)
        $data.push({"text" => line, "urls" => urls, "cat" => $cat[level] || "None"}) 
if urls
        $cat[level+1] = line
end

ARGV.each do |url|
        $reference = []
        $cat = []
        $data = []

        # XXX this works, at least for linux-sound.org
        url =~ /(\w+\.\w+)$/
        loc = $1 or raise("Help me at XXX!")

        `wget #{url}`
        if $? != 0 
                exit 1
        end

        tmp = loc + ".dump"

        # unset locales (we need ^References$)
        ENV["LANG"] = "C"

        `lynx -dump #{loc} > #{tmp}`
        if $? != 0
                STDERR << "calling lynx failed! Is it installed?\n"
                exit 1
        end

        # extract link list (legend)
        out = `sed -n '/^References$/,$p' #{tmp} | sed -n '3,$p'`.split(/$/)
        if $? != 0
                STDERR << "calling sed failed! Is it installed?\n"
                exit 1
        end

        out.each do |line| 
                ary = line.split
                $reference[ary[0].to_i] = ary[1]
        end

        # extract data
        $lines = `sed -n '1,/^References$/p' #{tmp}`.split(/$/)

        File.delete(tmp)

        # we need a stack
        $lines.reverse!

        # traverse all lines
        loop do
                line = $lines.pop
                break if not line
                
                # title
                if line =~ /^  (\S.*)$/
                        $cat[1] = $1
                        next
                end
                
                # li level 1
                if line =~ /     \* (\S.*)$/
                        line = $1
                        push_li(line, 1, /^       (\S.*)$/)
                        next
                end
                
                # li level 2
                if line =~ /          \+ (\S.*)$/
                        line = $1
                        push_li(line, 2, /^            (\S.*)$/)
                        next
                end
                
                # li level 3
                if line =~ /^               o (\S.*)$/
                        line = $1
                        push_li(line, 3, /^                 (\S.*)$/)
                        next
                end
                
                # li level 4
                if line =~ /^                    # (\S.*)$/
                        line = $1
                        push_li(line, 4, /^                      (\S.*)$/)
                        next
                end
                
                # there is no higer level, right?
        end

        $data.sort! do |a, b| 
                if a["cat"] == b["cat"]
                        ret = a["text"] <=> b["text"] 
                else 
                        ret =  a["cat"] <=> b["cat"] 
                end
                ret
        end

        print "Text\tUrls\tCategory\n" if $print_header
        $data.each do |hash|
                print "#{hash['text']}\t#{hash['urls']}\t#{hash['cat']}\n"
        end
end

#!/bin/bash

for i in cd cshelp demo ddj docs drum dsp different fx compress convert games \
	guitar jack java ladspa distro music mlists midi mix mod mpeg hdr \
	notation mut players radio repositories scopes swss tools drivers \
	snded sounds speech telephony difficult; do

	./extract-linux-sound-org-data.ruby http://linux-sound.org/$i.html > $i.csv
	# -H prints an additional cvs header
	# ./extract-linux-sound-org-data.ruby -H http://linux-sound.org/$i.html > $i.csv
done

linux-sound.org's data as csv was: Re: [linux-audio-dev] Linux soundapps pages updated

Reply via email to