Re: PHP-Lucene Integration

2005-04-05 Thread Giovanni Novelli
As Lucene native language is Java it should be more natural to access its functionalities through JSP; anyway the idea of accessing Lucene functionalities seems interesting as PHP is perhaps most widely deployed server side scripting language. I think that the way to provide access to Lucene AP

Text extraction from HTML

2005-07-29 Thread Giovanni Novelli
d the HtmlParser coming with Nutch but I wasn't able to make it work without adjusting global configuration Nutch's xml; perhaps it's the only way to make such plugin work? Does Lucene expose any good HTML parser in the contrib section to parse web pages found in the wild? Best regards, G

Re: Text extraction from HTML

2005-07-29 Thread Giovanni Novelli
I have tried both HtmlParser v1.5 and NekoHTML. About the former my implementation doesn't work as i.e. it get text from javascripts; I have followed the hint from http://htmlparser.sourceforge.net/javadoc/org/htmlparser/visitors/TextExtractingVisitor.html The following is my NOT working implement

Re: What is stemming?

2005-11-20 Thread Giovanni Novelli
[Afaik] Lucene stemming is based on Snowball (http://snowball.tartarus.org/) and snowball is an implementation of Porter's algorithm ( http://www.tartarus.org/~martin/PorterStemmer/) so, if I'm not wrong, you should refer to them.