Hi Guys,
I have few questions:
1- I found that we have the lib "lib-lucene-analyzers" in the plugin folder.
How does it works, should i just add the definition "lib-lucene-analyzers"
in the list of plugins in nutch-site.xml or should I also add
language-identifier, analysis-(fr|de|en) ?
2- How do we know the name of the plugin we have to add in nutch-site.xml ?
Actually I've just added analysis-fr in the list and I've got an exception
which said that it coudl not find org.apache.lucene.analyzer.FrenchAnalyzer.
It was looking for a lucene implementation of the plugin instead of the
nutch implementation. I don't know why.
is there any mapping between the plugin name and a class ?
3- I tried to implement an HTMLParseFilter but there are few things that i
don't understand.
What is the aim of a ParseResult ? Actually I don't understand why we could
store many parseresult ? Is there any specific usage ?
Why do we call the htmlparsefilter.filter after having created a first
ParseResult ?
How should i proceed if i want to remove some tag + content of those tags in
the Html page? Should i reparse again the page and create another
ParseResult which i will only use ? For instance, I don't want to index some
content. i want to remove all content of each Select box in my html page. I
thought I could do it in a HtmlParseFilter but i notice that I will waste
some processing time because it will parse and create a first ParseResult
(which i will never use) and then it will do it again (in my
htmlparsefilter) to get the real text content that i need to index.
I may have miss something in this case i will appreciate your help.
Cheers
E
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general