Re: demo HTML parser question
On Thu, 23 Sep 2004 10:53:26 -0700, Doug Cutting wrote > [EMAIL PROTECTED] wrote: > > We were originally attempting to use the demo html parser (Lucene 1.2), but as > > you know, its for a demo. I think its threaded to optimize on time, to allow > > the calling thread to grab the title or top message even though its not done > > parsing the entire html document. > > That's almost right. I originally wrote it that way to avoid having > to ever buffer the entire text of the document. The document is > indexed while it is parsed. But, as observed, this has lots of > problems and was probably a bad idea. > > Could someone provide a patch that removes the multi-threading? > We'd simply use a StringBuffer in HTMLParser.jj to collect the text. > Calls to pipeOut.write() would be replaced with text.append(). > Then have the HTMLParser's constructor parse the page before > returning, rather than spawn a thread, and getReader() would return > a StringReader. The public API of HTMLParser need not change at all > and lots of complex threading code would be thrown away. Anyone > interested in coding this? While we're on the subject... When using the HTMLParser I tend to get a lot of token manager errors that basically kill the thread (usually unexpected EOF). Even if we were to remove the multi-threading of the HTMLParser, these token manager errors would pretty much kill the calling app (Error vs Exception). Any idea how to get around this? Perhaps this question really belongs on the javacc list? Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: demo HTML parser question
[EMAIL PROTECTED] wrote: We were originally attempting to use the demo html parser (Lucene 1.2), but as you know, its for a demo. I think its threaded to optimize on time, to allow the calling thread to grab the title or top message even though its not done parsing the entire html document. That's almost right. I originally wrote it that way to avoid having to ever buffer the entire text of the document. The document is indexed while it is parsed. But, as observed, this has lots of problems and was probably a bad idea. Could someone provide a patch that removes the multi-threading? We'd simply use a StringBuffer in HTMLParser.jj to collect the text. Calls to pipeOut.write() would be replaced with text.append(). Then have the HTMLParser's constructor parse the page before returning, rather than spawn a thread, and getReader() would return a StringReader. The public API of HTMLParser need not change at all and lots of complex threading code would be thrown away. Anyone interested in coding this? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: demo HTML parser question
Hi Fred, We were originally attempting to use the demo html parser (Lucene 1.2), but as you know, its for a demo. I think its threaded to optimize on time, to allow the calling thread to grab the title or top message even though its not done parsing the entire html document. That's just a guess, I would love to hear from others about this. Anyway, since it is a separate thread, a token error could kill it and there is no way for the calling thread to know about it. We had to create our own html parser since we only cared about grabbing the entire text from the html document and also we wanted to avoid the extra thread. We also do a lot of "SKIP"ping for minimal EOF errors (html documents in email almost never follow standards). For your html needs, you might want to check out other JavaCC HTML parsers from the JavaCC web site. Roy. On Wed, 22 Sep 2004 22:42:55 -0400, Fred Toth wrote > Hi, > > I've been working with the HTML parser demo that comes with > Lucene and I'm trying to understand why it's multi-threaded, > and, more importantly, how to exit gracefully on errors. > > I've discovered if I throw an exception in the front-end static > code (main(), etc.), the JVM hangs instead of exiting. Presumably > this is because there are threads hanging around doing something. > But I'm not sure what! > > Any pointers? I just want to exit gracefully on an error such as > a required meta tag is missing or similar. > > Thanks, > > Fred > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
demo HTML parser question
Hi, I've been working with the HTML parser demo that comes with Lucene and I'm trying to understand why it's multi-threaded, and, more importantly, how to exit gracefully on errors. I've discovered if I throw an exception in the front-end static code (main(), etc.), the JVM hangs instead of exiting. Presumably this is because there are threads hanging around doing something. But I'm not sure what! Any pointers? I just want to exit gracefully on an error such as a required meta tag is missing or similar. Thanks, Fred - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]