O with other datasources

Robert O'Connor Tue, 19 Feb 2002 12:06:42 -0800

Does anyone have an opinion on usage of an RDF/XML spec for describing the
description of content that can be imported or exported into Plucker? Right
now the Plucker Desktop showcase is using an html file for reading the
descriptions, but XML is a better long-term strategy. Probably one of the
more difficult aspects of a non-XML, or ini file format is arrays of items,
as there is the problem of the delimiter then not allowable in the entry.
One of the most useful aspects of a showcase file is the exclusions as they
take effort to set up, so have the most use if they are premade, ready to
go.


I finished off an XML parser in C++ to read XML Plucker channel
descriptions. Expat was chosen as the base because it is the fastest,
validation isn't needed (or planned), and expat is already in the Plucker
Desktop code as the XML resource parser is based on expat, so no extra
library size is needed in the final executable.

What remains is what the XML file should be:

Options are:
[] Make it RDF-like.
[] Use namespaces of metadata descriptions already existing:
   --Dublin Core (dc) for the language the channel is in, author, area of
coverage (these are columns in the showcase dialog, so can sort out channels
in the user's language, location).
   --Syndication (sy) for the update frequency, period and database.
[] Non proprietary: elements aren't called plucker*.
[] Room for extension by later plucker, and/or other types of programs.
[] Can be logically read by a human reader.
[] Organized by <some_property>1</some_property> instead of <some_property
value="1" />
   --Except for exclusion list, which is more readable in the exclusion_list
style of
     <exclusion_item action="include" priority="1">.*zip$</exclusion_item>
[] Use unique names of elements such as <exclusion_item> or use items and
list elements
   such as <exclusion_list>
             <li>...
             <li>...
           </exclusion_list>
[] For new-user readability, either use namespaces, such as
<images:max_compression> or to nest the things inside an <images> tag for
clarity, such as
<images><max_compression>1</max_compression><bpp>1</bpp></images>
The <images> can just be ignored during parse if unique names for elements,
doesn't do anything, just makes it more obvious of what those group of
things are related to, instead of a giant amorphous mass of data elements.
[] Use underscores or case for element names. I like underscores better, but
case is usually the way the rest of the world works. Others use updatePeriod
instead of update_period. Better to use one style or the other consistently
in the format.
[] Elements named in the positive; ie
<include_url_info>1</include_url_info>, so not double negative confusion.
[] Ask if other interested parties (Mozilla, mazingo, open directory
projects, etc) willing to participate in hammering out a standard, so that
there isn't 99 versions of expressing the same data. The Dublin Core idea
works well, as far as having a shared term to describe a documents library
information, but no spec exists for handheld sites, so we might as well make
one.
[] All descriptions must live inside a top-level node, so need to decide
what to call that too.


Below is an organization of elements nested by topic, and with RDF currently
removed for clarity. I think namespaces would work better though.

Best wishes,
Robert



<?xml version="1.0" encoding="utf-8"?>

<channel_list>
    <channel>
        <title>Advogato</title>
        <link>http://www.advogato.org</link>

        <spidering_options>
            <verbosity>1</verbosity>
            <close_on_exit>1</close_on_exit>
            <close_on_error>0</close_on_error>
        </spidering_options>

        <limits_options>
            <maxdepth>1</maxdepth>
            <stayonhost>1</stayonhost>
            <stayondomain>1</stayondomain>
            <exclusions>

<exclusion_file>http://www.advogato.com/exclusionlist.rss</exclusion_file>
                <exclusion_item action="exclude"
priority="0">http://\.www\.osdn\.com.*</exclusion_item>
            </exclusions>
        </limits_options>

        <security_options>
           <copyprevention_bit>0</copyprevention_bit>
           <owner_id>Bill</owner_id>
        </security_options>

        <images_options>
            <bpp>1</bpp>
            <maxheight>250</maxheight>
            <maxwidth>150</maxwidth>
            <alt_maxheight>1000000</alt_maxheight>
            <alt_maxwidth>1000000</alt_maxwidth>
            <image_compression_limit>50</image_compression_limit>
        </images_options>

        <output_options>
            <backup_bit>1</backup_bit>
            <launchable_bit>0</launchable_bit>
            <no_url_info>1</no_url_info>
            <categories>
                <category>News</category>
                <category>Linux</category>
            </categories>
            <compression>zlib</compression>
            <launcher:show_icon>1</launcher:show_icon>
            <launcher:large_icon_file>big.bpp</launcher:large_icon_file>
            <launcher:small_icon_file>small.bpp</launcher:small_icon_file>
        </output_options>

        <destinations>
            <doc_file>C:\windows\desktop\ethics</doc_file>
            <sync_users>
                <user>Rob O'Connor</user>
                <user>Steve Evans</user>
            </sync_users>
            <copy_to_directories>
                <directory>C:\output</directory>
                <directory>C:\publish</directory>
            </copy_to_directories>
        </destinations>

        <autoupdate_options>
            <sy:updatePeriod>daily</sy:updatePeriod>
            <sy:updateFrequency>2</sy:updateFrequency>
            <sy:updateBase>2000-01-01T12:00+00:00</sy:updateBase>
        <autoupdate_options>

        <!---Dublin core--->

        <!---Needed? Already have a title--->
        <dc:title>Advogato</dc:title>
        <!---TGN qualifier (gettysburg thesaurus) seems like best bet here,
since some--->
        <!---channels are city-specific, others by country, continent, or
world--->
        <dc:coverage>Boston, MA, USA</coverage>
        <dc:description>The free software developer's
advocate.</dc:description>
        <dc:subject>Free software, GPL, News</dc:subject>
        <!---Needed? 99% are going to be text. Only perhaps a few image
only--->
        <dc:type>text</dc:type>

        <dc:creator>Advagato team([EMAIL PROTECTED])</dc:creator>
        <dc:publisher>Independent</dc:publisher>
        <dc:rights>All rights reserved</dc:rights>
        <dc:contributor>John Spencer</dc:contributor>

        <!---Qualifier here? Created, or updated? Relevant when there is an
updateBase?--->
        <dc:date>2002-02-14</dc:date>
        <dc:language>en</dc:language>

        <!---Future additions:--->
        <!---Whether channel currently active or not--->
        <active>1</active>
        <!---Strip out alt tags for uncrawled images--->
        <strip_alt_tags>1</strip_alt_tags>

        <!---Future optionals--->
        --An image for the channel, like in an RSS.
        --Number of pages after which to abort spider and build a pdb of
what was done.
        --Size (in kb) of output file after which to abort spider and build
the pdb.
        --Estimate of (kb)size of a channel (or combo with previous?)
    </channel>

</channel_list>

Request for comments: RDF/XML descriptions for Plucker I/O with other datasources

Reply via email to