I am using the default xmlparser-conf.xml, just copied it into
nutch/conf dir. To test it I used the xml file given in the sample
directory xmltest.xml and is uploaded at http://www.jkg.in/xmltest.xml
.
I do not get any errors while indexing or parsing. The crawl log is
attached. I am able to get the xml file in the results when I search
for 'XPath' but when I click the explain link, it doesn't show me the
field dctitle in the index which it should.
I just noticed that hadoop.log has some error for handling xml files
and I cannot see parse-xml loaded, but I have it enabled in my
nutch-site.conf. I am new to nutch-0.8 and hadoop so I have no idea
whether this is expected behaviour/ how to fix it.
Thanks and Best Regards,
Jayant
On 11/5/06, Nutch Newbie <[EMAIL PROTECTED]> wrote:
Can you post your "xmlparser-conf.xml" from the nutch/conf dir ?
Also what kind of error message do you get when you index?
You can use Luke to see the index...
Regards,
On 11/4/06, Jayant Kumar Gandhi <[EMAIL PROTECTED]> wrote:
> Hello Everyone,
>
> I am just installed nutch-0.8.1 on my dev machine. I installed a new
> plugin called XML Parser available at
> http://issues.apache.org/jira/browse/NUTCH-185
> The issue is that I am unable get it to work.
> I copied the parse-xml folder to src/plugin folder. I made the
> corresponding deploy/ clean entries in the build xml file.
>
> Also, I have editied the nutch conf to enable xml plugin.
> The plugin is still not working. After compiling using ant, I started
> indexing. After the indexing was finished and query done, I couldnt
> see the indexed fields on the explain page.
>
> Any inputs?
>
> Thanks,
> Jayant
>
--
www.jkg.in | http://www.jkg.in/contact-me/
Jayant Kr. Gandhi
<?xml version="1.0" encoding="UTF-8"?>
<!--
Title : Nutch XMLParser config file
Author : Rida Benjelloun
Email : [EMAIL PROTECTED]
-->
<nutchXmlParser>
<!--List of properties-->
<xmlIndexerProperties type="filePerDocument" namespace="http://purl.org/dc/elements/1.1/">
<field name="dctitle" xpath="//dc:title" type="Text" boost="1.4"/>
<field name="dccreator" xpath="//dc:creator" type="keyword" boost="1.0"/>
</xmlIndexerProperties>
<!--Parse full text if the document dont contain any namespace-->
<xmlIndexerProperties type="filePerDocument" namespace="default">
<field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0"/>
</xmlIndexerProperties>
</nutchXmlParser>
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(xml|text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
<property>
<name>http.agent.name</name>
<value>KhojGuruBot</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value>KhojGuruBot</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value>http://www.khojguru.com/bot.html</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value>[EMAIL PROTECTED]</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
<property>
<name>searcher.dir</name>
<value>C:\cygwin\home\HK\nutch-0.8.1\crawl</value>
<description>
Path to root of crawl. This directory is searched (in
order) for either the file search-servers.txt, containing a list of
distributed search servers, or the directory "index" containing
merged indexes, or the directory "segments" containing segment
indexes.
</description>
</property>
</configuration>
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general