Re: [Nutch-general] Indexing xml documents on local file system

Chris Mattmann Mon, 27 Nov 2006 09:35:08 -0800

Hi Thorsten

On 11/27/06 4:00 AM, "Thorsten Scherler"
<[EMAIL PROTECTED]> wrote:


> 
> Reading the wiki and the docu I get the impression I need to write my
> own implementation of an indexer/searcher plugin, which is able to
> filter/index crucial filter information such as <summary year="2006"
> number="209" date="27-10-2006" section="1">, <organisation
> name="Consejería de Economia y Hacienda"> and <disposition
> type="Resolución" >.

 Yes, you may need to write your own parse, indexer and searcher plugins,
however, I am currently working on getting the parse-xml plugin into the
Nutch sources. The parse-xml plugin includes an indexing filter for the
fields that are extracted by the xml parser. The XML parser is configurable
to custom schemas and fields that need to be extracted.

 This plugin is available currently in JIRA, attached to this issue:

http://issues.apache.org/jira/browse/NUTCH-185

I am working hard to get this plugin ported to the latest trunk source, and
ready to be committed to the sources. I hope to attach a patch within the
next week that brings this plugin up to date, and gets the code ready for
prime-time (formatting, public javadocs, etc.). Once I attach the patch, you
may find that you only need to write your searcher plugin. Then again, in
the interest of time, you may go the route for writing your own set of
plugins. In that case, you can find examples of how to write the
parse/index/query plugins, by looking at the Nutch source, in subversion,
available here:

Parse plugins: 
http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/plugin/parse-*
Index plugins: 
http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/plugin/index-*
Query plugins: 
http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/plugin/query-*


> 
> Still being a newbie to nutch I would appreciate the opinion of
> experienced devs whether nutch is the right choice and if so how I
> should start. 

I think that you could do this with Nutch, and if you do, for free, you get:

Crawling
Parsing/Indexing
Search Webapp, and RSS based OpenSearch servlet

You could also do this with Lucene, but I think you may find that you end up
maintaining more code, and having to rewrite existing functionality
available within Nutch.

Just my 2 cents...

Cheers,
  Chris


______________________________________________
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Indexing xml documents on local file system

Reply via email to