I am having problems plucking using <include> with JPluck 2.pre7 that works with 
JPluck 0.9.1.

The site I am trying to pluck is the press releases section of EurakAlert 
(http://www.eurekalert.org/pubnews.php).  I am interested only in the first page of 
the press releases page and the first level links and inline images that are hosted on 
that website.

In my JPLuck 0.9 jxl file I have a section that looks like the following.  
"pub_release" are in the urls I want to pluck.  The images I want have 
"release_graphics" in their url:
=====
<document stayOnHost="true" stayBelowPath="false" linkDepth="1" includeImages="true" 
includeImageAltText="true" includeFullSizeImages="false">
        <name>EurekAlert</name>
        <startPage>http://www.eurekalert.org/pubnews.php</startPage>
        <category>News</category>
        <language>EN</language>
        <include>
                <pattern>.*/pub_releases/.*</pattern>
                <pattern>.*/release_graphics/.*</pattern>
        </include>
</document>
=====

My JPluck 2.0 jxl file to pluck the site looks like this:
=====
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE jxl PUBLIC "-//jpluck//DTD JXL 2.0//EN" 
"http://jpluck.sourceforge.net/jxl/DTD/jxl-2.0.dtd";>
<jxl>
        <site>
                <name>EurekAlert</name>
                <uri maxDepth="1" 
restrict="host">http://www.eurekalert.org/pubnews.php</uri>
                <include>
                        <pattern>.*/pub_releases/.*</pattern>
                        <pattern>.*/release_graphics/.*</pattern>
                </include>
                <images>
                        <embedded bpp="16" maxHeight="150" maxWidth="150"/>
                </images>
                <category>News</category>
        </site>
</jxl>
=====

While the Jpluck 0.9 pluck works, JPluck 2.0 would fetch a lot of unwanted links and 
images from the site.

Am I doing something wrong with <include> in the JPluck 2.0 jxl file?

Regards,
Kam-Yung

_______________________________________________________
The FREE service that prevents junk email http://www.mailshell.com
_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Reply via email to