Hi all,

I am developing a search engine for a governmental body. This search
engine has to index pure xml documents which follow a custom xml schema.
The xml documents contain information about laws and official
announcements for Andalusia.

I need to implement different filter for the search. The current search
engine which can be found here [1] would need to be extended by ranges
about organizational bodies, kind of announcement (law,
resolution,...), ...

I played a bit with Nutch 0.8 and am asking myself whether it is best
tool for the task. I got nutch to index the xml documents and I can as
well search the index, but I would need to add filter conditions for the
search. The alternative I see would be pure lucene since I am actually
not really "crawling" the site since the documents are not linked with
each other but put all the files (which have to be indexed) in the
urls/bulletin file.

To give you a better impression of the underlying architecture and xml
documents, each weekday there is a new bulletin (containing approx. 100
- 200 pages) eg [2]. This bulletin is stored on the file system and need
to be indexed. 

We have two different document types summaries and dispositions. The
summary looks like:
<summary year="2006" number="209" date="27-10-2006" section="1"
  startPage="8" endPage="20">
  <title>1. DISPOSICIONES GENERALES</title>
  <organisation name="Consejería de la Presidencia">
    <disposition bojaYear="2006" bojaNumber="209"
      bojaSection="1" type="Decreto" startPage="8" endPage="10"
      date="10-11-2006" detail="999952" law="178/2006"> Decreto
      178/2006, de 10 de octubre, por el que se establecen normas de
      protección de la avifauna para las instalaciones eléctricas de
      alta tensión</disposition>
  </organisation>
  <organisation name="Consejería de Economia y Hacienda">
    <disposition bojaYear="2006" bojaNumber="209"
      bojaSection="1" type="Resolución" startPage="10"
      endPage="12" date="10-11-2006" detail="999961">
      Resolución de 10 de octubre de 2006, de la Dirección General de
      Tesorería y Deuda Pública, por la que se realiza una
      convocatoria de subasta de carácter ordinario dentro del
      Programa de Emisión de Bonos y Obligaciones de la Junta de
      Andalucía.</disposition>
  </organisation>
</summary>

Reading the wiki and the docu I get the impression I need to write my
own implementation of an indexer/searcher plugin, which is able to
filter/index crucial filter information such as <summary year="2006"
number="209" date="27-10-2006" section="1">, <organisation
name="Consejería de Economia y Hacienda"> and <disposition
type="Resolución" >.

Still being a newbie to nutch I would appreciate the opinion of
experienced devs whether nutch is the right choice and if so how I
should start. 

TIA for any information.

salu2

[1] http://andaluciajunta.es/portal/aj-bojaBuscador/0,22815,,00.html 
[2]
http://andaluciajunta.es/portal/boletines/2006/11/aj-bojaVerPagina-2006-11/0,23167,bi%253D693228039889,00.html
- 


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to