Hi all,
I am developing a search engine for a governmental body. This search
engine has to index pure xml documents which follow a custom xml schema.
The xml documents contain information about laws and official
announcements for Andalusia.
I need to implement different filter for the search. The current search
engine which can be found here [1] would need to be extended by ranges
about organizational bodies, kind of announcement (law,
resolution,...), ...
I played a bit with Nutch 0.8 and am asking myself whether it is best
tool for the task. I got nutch to index the xml documents and I can as
well search the index, but I would need to add filter conditions for the
search. The alternative I see would be pure lucene since I am actually
not really "crawling" the site since the documents are not linked with
each other but put all the files (which have to be indexed) in the
urls/bulletin file.
To give you a better impression of the underlying architecture and xml
documents, each weekday there is a new bulletin (containing approx. 100
- 200 pages) eg [2]. This bulletin is stored on the file system and need
to be indexed.
We have two different document types summaries and dispositions. The
summary looks like:
<summary year="2006" number="209" date="27-10-2006" section="1"
startPage="8" endPage="20">
<title>1. DISPOSICIONES GENERALES</title>
<organisation name="Consejería de la Presidencia">
<disposition bojaYear="2006" bojaNumber="209"
bojaSection="1" type="Decreto" startPage="8" endPage="10"
date="10-11-2006" detail="999952" law="178/2006"> Decreto
178/2006, de 10 de octubre, por el que se establecen normas de
protección de la avifauna para las instalaciones eléctricas de
alta tensión</disposition>
</organisation>
<organisation name="Consejería de Economia y Hacienda">
<disposition bojaYear="2006" bojaNumber="209"
bojaSection="1" type="Resolución" startPage="10"
endPage="12" date="10-11-2006" detail="999961">
Resolución de 10 de octubre de 2006, de la Dirección General de
Tesorería y Deuda Pública, por la que se realiza una
convocatoria de subasta de carácter ordinario dentro del
Programa de Emisión de Bonos y Obligaciones de la Junta de
Andalucía.</disposition>
</organisation>
</summary>
Reading the wiki and the docu I get the impression I need to write my
own implementation of an indexer/searcher plugin, which is able to
filter/index crucial filter information such as <summary year="2006"
number="209" date="27-10-2006" section="1">, <organisation
name="Consejería de Economia y Hacienda"> and <disposition
type="Resolución" >.
Still being a newbie to nutch I would appreciate the opinion of
experienced devs whether nutch is the right choice and if so how I
should start.
TIA for any information.
salu2
[1] http://andaluciajunta.es/portal/aj-bojaBuscador/0,22815,,00.html
[2]
http://andaluciajunta.es/portal/boletines/2006/11/aj-bojaVerPagina-2006-11/0,23167,bi%253D693228039889,00.html
-
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general