Re: [Nutch-general] XMLParser for Nutch

Jayant Kumar Gandhi Sat, 04 Nov 2006 23:19:12 -0800

I am using the default xmlparser-conf.xml, just copied it into
nutch/conf dir. To test it I used the xml file given in the sample
directory xmltest.xml and is uploaded at http://www.jkg.in/xmltest.xml
.


I do not get any errors while indexing or parsing. The crawl log is
attached. I am able to get the xml file in the results when I search
for 'XPath' but when I click the explain link, it doesn't show me the
field dctitle in the index which it should.

I just noticed that hadoop.log has some error for handling xml files
and I cannot see parse-xml loaded, but I have it enabled in my
nutch-site.conf. I am new to nutch-0.8 and hadoop so I have no idea
whether this is expected behaviour/ how to fix it.

Thanks and Best Regards,
Jayant

On 11/5/06, Nutch Newbie <[EMAIL PROTECTED]> wrote:

Can you post your "xmlparser-conf.xml" from the nutch/conf dir ?
Also what kind of error message do you get when you index?
You can use Luke to see the index...

Regards,

On 11/4/06, Jayant Kumar Gandhi <[EMAIL PROTECTED]> wrote:
> Hello Everyone,
>
> I am just installed nutch-0.8.1 on my dev machine. I installed a new
> plugin called XML Parser available at
> http://issues.apache.org/jira/browse/NUTCH-185
> The issue is that I am unable get it to work.
> I copied the parse-xml folder to src/plugin folder. I made the
> corresponding deploy/ clean entries in the build xml file.
>
> Also, I have editied the nutch conf to enable xml plugin.
> The plugin is still not working. After compiling using ant, I started
> indexing. After the indexing was finished and query done, I couldnt
> see the indexed fields on the explain page.
>
> Any inputs?
>
> Thanks,
> Jayant
>


--
www.jkg.in | http://www.jkg.in/contact-me/
Jayant Kr. Gandhi

<?xml version="1.0" encoding="UTF-8"?>
<!--
	Title : Nutch XMLParser config file
	Author : Rida Benjelloun
	Email : [EMAIL PROTECTED]
-->
	
<nutchXmlParser>
	<!--List of properties-->
	<xmlIndexerProperties type="filePerDocument" namespace="http://purl.org/dc/elements/1.1/";>
		<field name="dctitle" xpath="//dc:title" type="Text" boost="1.4"/>
		<field name="dccreator" xpath="//dc:creator" type="keyword" boost="1.0"/>
	</xmlIndexerProperties>
	
	<!--Parse full text if the document dont contain any namespace-->
	<xmlIndexerProperties type="filePerDocument" namespace="default">
		<field name="xmlcontent" xpath="//*" type="Unstored" boost="1.0"/>		
	</xmlIndexerProperties>
</nutchXmlParser>

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(xml|text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

<property>
  <name>http.agent.name</name>
  <value>KhojGuruBot</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

	http.robots.agents
	http.agent.description
	http.agent.url
	http.agent.email
	http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value>KhojGuruBot</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value>http://www.khojguru.com/bot.html</value>
  <description>A URL to advertise in the User-Agent header.  This will 
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value>[EMAIL PROTECTED]</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>

<property>
  <name>searcher.dir</name>
  <value>C:\cygwin\home\HK\nutch-0.8.1\crawl</value>
  <description>
  Path to root of crawl.  This directory is searched (in
  order) for either the file search-servers.txt, containing a list of
  distributed search servers, or the directory "index" containing
  merged indexes, or the directory "segments" containing segment
  indexes.
  </description>
</property>

</configuration>

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] XMLParser for Nutch

Reply via email to