Hi
Html parser is simple. It runs through a page and extracts content of tags.
So your task will be to take the first(second) h1 tag and extract its
content to a separate field.
you may look at DOMContentUtils.getTextHelper() in the plugin. It extracts
text from tags.
And you add a new filed to metadata as
metadata.add("mytitle", newtitle);
In a indexer you need to declare your new metadata
In Nutch 1.0 it's done in the function addIndexBackendOptions of a indexer
plugin (index-basic, index-more)
LuceneWriter.addFieldOptions("mytitle", LuceneWriter.STORE.YES,
LuceneWriter.INDEX.TOKENIZED, conf);
and during the indexing phase you should pick up the field from metadata
where you have put this field and store it in Lucene.
Then you can implement searching.
Best Regards
Alexander Aristov
2009/5/27 Felix Zimmermann <[email protected]>
> Hi,
>
> a huge part of the sites I crawl do not have any meaningful text inside
> of <title></title>. Unlike this, the <h1></h1> has often much more
> detailed information about the content of the page.
>
> How can I get the content of (only) the first element of <h1></h1> into
> a seperate field, called e.g. "alttitle" in order to index it later?
> I've given up to understand the HtmlParser.java.
>
> For the indexing part, I know how to create a plugin (e.g. based on
> index-more). My big problem is to write a custom parse plugin or to
> modify the htmlparser. Unfortunately all previous hints in this mailing
> list that I found assume me being expert in java. I think, I need a
> little bit more detailled help;-)
>
> Thanks for any help,
> Felix.
>
>