Re: HTML parser

2002-04-19 Thread Brian Goetz
>While trying to research the same thing, I found the following...here's a >good example of link extraction. Try http://www.quiotix.com/opensource/html-parser Its easy to write a Visitor which extracts the links; should take about ten lines of code. -- Brian Goetz Quiotix Corporation [

Re: HTML parser

2002-04-19 Thread Erik Hatcher
HttpUnit (which uses JTidy under the covers) makes childs play out of pulling out links and navigating to them. The only caveat (and this would be true for practically all tools, I suspect) is that the HTML has to be relatively well-formed for it to work well. JTidy can be somewhat forgiving tho

Re: HTML parser

2002-04-19 Thread David Black
While trying to research the same thing, I found the following...here's a good example of link extraction. http://developer.java.sun.com/developer/TechTips/1999/tt0923.html It seems like I could use this to also get the text out from between the tags but haven't been able to do it yet. It

RE: Wildcard Searching

2002-04-19 Thread Otis Gospodnetic
Did the change that you mentioned below really work for you? I wrote this class: http://nagoya.apache.org/bugzilla/showattachment.cgi?attach_id=1638 and it looks like the bug is not in QueryParser, but in some Java class (could it be WildcardTermEnum?), since the class does not make use of QueryP

RE: Removing a write.lock file

2002-04-19 Thread Armbrust, Daniel C.
Since I am far from a lucene expert, I would suggest searching the archive for write.lock, I'm sure this has been discussed before. My belief is that when you call the close method to close your index writer, it removes the write.lock file. Someone please verify/correct me if I'm wrong. --

RE: HTML parser

2002-04-19 Thread Otis Gospodnetic
Such classes are not included with Lucene. This was _just_ mentioned on this list earlier today. Look at the archives and search for crawler, URL, lucene sandbox, etc. Otis --- Ian Forsyth <[EMAIL PROTECTED]> wrote: > > Are there core classes part of lucene that allow one to feed lucene > links

Re: Wildcard query problem with "?"

2002-04-19 Thread Otis Gospodnetic
Hm, I just went through all the diffs after RC2 (QueryParser.jj revision 1.3) and I can't see where '?' was dropped. However, one user reported this on February 27th: We just tried adding the "?" character to QueryParser.jj under <#_TERM_START_CHAR>. We noticed that the "*" was in that list, so w

RE: HTML parser

2002-04-19 Thread Ian Forsyth
Are there core classes part of lucene that allow one to feed lucene links, and 'it' will capture the contents of those urls into the index.. or does one write a file capture class to seek out the url store the file in a directory, then index the local directory.. Ian -Original Message

Re: HTML parser

2002-04-19 Thread Terence Parr
Hi Otis, The idea behind stripHTML is pretty simple. It's just a hand-built lexer that looks like this: while more char if comment start, scarf til end comment if char is < then if SCRIPT tag scarf til end SCRIPT; [same with A, STYLE, HEA

PorterStemmer default access

2002-04-19 Thread Paul Dlug
Is there any reason that org.apache.lucene.analysis.PorterStemmer has default access instead of being public? I wanted to use it in a custom filter instead of using PorterStemFilter but couldn't because of it's access. -- To unsubscribe, e-mail: For additional command

RE: HTML parser

2002-04-19 Thread Mark Ayad
You can use the swing html parser to do this but it's only a 3.2 DTD based parser. I have written (attached) a totall hack job for braking up an html page into its component parts, the code gives you an idea ... If anyone wants to know how to use the swing based parser I add some code ? Mark

Re: Rationale for having boolean operators as ALL CAPS

2002-04-19 Thread Peter Carlson
Thanks Brian. On 4/18/02 4:20 PM, "Brian Goetz" <[EMAIL PROTECTED]> wrote: > >> Can someone tell me the rationale for having the boolean operator only work >> if they are all caps? > > I can, since I was the one who made this decision. > > Most queries are entered in lower or mixed case. Tr

Re: Some questions - Analyzer

2002-04-19 Thread Rosen Marinov
see Lucene Officila FAQ: Question 17: 17. Can I write my own custom analyzer ? Sure. An analyzer is basically a factory object that creates a TokenStream object used to tokenized the text. A typical analyzer implementation creates the TokenStream by creating a standard tokenizer and combining it

Re: Some questions

2002-04-19 Thread Marco Ferrante
> ... snip ... > I think that there isn't any Italian Anylizer, is it? > How can I write one? > > ... snip ... I'am interesting in this contribute too. Can I help? -- Marco Ferrante ([EMAIL PROTECTED]) CSITA (Centro Servizi Informatici e Telematici

Re: Some questions

2002-04-19 Thread Karl Øie
> Well, I saw that lucene create the index on the filesystem: I think > that this is a problem for producion enviroment. I usually use > Database, for example Oracle. > Is it possible integrate Lucene with Oracle or some other db (Mysql)? you can store the index in blob-fields, but thats about

Some questions

2002-04-19 Thread [EMAIL PROTECTED]
Hi all, my name is Laura and I'm a new member of this list. I'm a long date user of tomcat and I'm also a meber of tomcat user list. Yesterday looking at the jakarta menu I saw lucene and I said:"What is this?" Reading lucene home page I understood that Lucene is a very interesting and