I have had the same questions, and I think there should have a filed in the
"Document" Object to tell indexer just skip indexing,but I didn't find it.So I
used a very rude way.Hope the other guys can provide a better method.
1. Set the return Document to "null" in the method "filter(Document doc...)"
in your own IndexingFilter.
2. In the method "Indexer.reduce()" add some statements to deal with null doc
right after the statements where filters were called. The modified cod
fragments might be like this:
try {
// run indexing filters
doc = this.filters.filter(doc, parse, (UTF8) key, fetchDatum,
inlinks);
} catch (IndexingException e) {
if (LOG.isWarnEnabled()) {
LOG.warn("Error indexing " + key + ": " + e);
}
return;
}
if (doc == null) {
if (LOG.isWarnEnabled()) {
LOG.warn("Skip indexing: " + key);
}
return;
}
----- Original Message -----
From: "Tobias Zahn" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Tuesday, January 30, 2007 2:57 AM
Subject: 'RegexIndexingFilter'
> Good evening!
> I have found out that it is impossible to index only some specific file
> types with nutch. Needing this feature, I thought of implementing an
> 'RegexIndexingFilter', if that would be the right thing to do so.
> I have read some sourcecode, but I couldn't find out how to tell the
> indexer that he shouldn't index a file.
>
> Hoping that I am on the right way I hope for your opinions, ideas and
> your help.
>
> TIA,
> Tobias Zahn
>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers