Re: org.apache.lucene.demo.IndexHTML - parse JSP files?
On Monday 24 March 2003 18:03, Michael Wechner wrote: John Bresnik wrote: anyone know of a quick and easy way to get this demo [org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to a crawler to create a local [static] version of the site [i.e. they are not longer JSP files just the html output from the original JSP file - but in the interest of keeping the URL intact, I need to parse the JSP extentions - the short question is, does anyone know of a way to *not* ignore the *.jsp files? just modify IndexHTML: there is one line in there which decides what extension it will index. There is another question I was wondering; since JSP is not XML (ie. can not be reliably parse using an XML or even HTML parser [or for that matter, even with simplest XML markup tokenizer that ignores nesting], needs a lower level scanner), has anyone tried connecting an actual JSP processor to Lucene? Or writing a simple one just meant for indexing, without having to execute code embedded? [the problem with JSP compared to XML is that it need not nest properly with HTML content around; one can use JSP inside attribute values, for example; thus, first JSP has to be processed to HTML, and then HTML needs to be further tokenized] Jakarta has to have at least one such processor (haven't looked at whether there's a separate component or if Tomcat just has one embedded?). Of course parsing JSP is problematic in many ways, not just getting jsp tagging out; dynamic portions probably just have to be ignored, and all text inside included (except for things inside comments). -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: org.apache.lucene.demo.IndexHTML - parse JSP files?
Maybe you need to write for example a jsp for the search interface, another jsp that take the word that you search and this second jsp page goes directly to a bean with lucene that will do the job. Michel -Original Message- From: Tatu Saloranta [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 25, 2003 3:46 PM To: Lucene Users List Subject: Re: org.apache.lucene.demo.IndexHTML - parse JSP files? On Monday 24 March 2003 18:03, Michael Wechner wrote: John Bresnik wrote: anyone know of a quick and easy way to get this demo [org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to a crawler to create a local [static] version of the site [i.e. they are not longer JSP files just the html output from the original JSP file - but in the interest of keeping the URL intact, I need to parse the JSP extentions - the short question is, does anyone know of a way to *not* ignore the *.jsp files? just modify IndexHTML: there is one line in there which decides what extension it will index. There is another question I was wondering; since JSP is not XML (ie. can not be reliably parse using an XML or even HTML parser [or for that matter, even with simplest XML markup tokenizer that ignores nesting], needs a lower level scanner), has anyone tried connecting an actual JSP processor to Lucene? Or writing a simple one just meant for indexing, without having to execute code embedded? [the problem with JSP compared to XML is that it need not nest properly with HTML content around; one can use JSP inside attribute values, for example; thus, first JSP has to be processed to HTML, and then HTML needs to be further tokenized] Jakarta has to have at least one such processor (haven't looked at whether there's a separate component or if Tomcat just has one embedded?). Of course parsing JSP is problematic in many ways, not just getting jsp tagging out; dynamic portions probably just have to be ignored, and all text inside included (except for things inside comments). -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
org.apache.lucene.demo.IndexHTML - parse JSP files?
anyone know of a quick and easy way to get this demo [org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to a crawler to create a local [static] version of the site [i.e. they are not longer JSP files just the html output from the original JSP file - but in the interest of keeping the URL intact, I need to parse the JSP extentions - the short question is, does anyone know of a way to *not* ignore the *.jsp files? thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: org.apache.lucene.demo.IndexHTML - parse JSP files?
John Bresnik wrote: anyone know of a quick and easy way to get this demo [org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to a crawler to create a local [static] version of the site [i.e. they are not longer JSP files just the html output from the original JSP file - but in the interest of keeping the URL intact, I need to parse the JSP extentions - the short question is, does anyone know of a way to *not* ignore the *.jsp files? just modify IndexHTML: there is one line in there which decides what extension it will index. HTH Michael thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: org.apache.lucene.demo.IndexHTML - parse JSP files?
ah thanks.. i couldnt find the demo classes [turns out they were in a different dir] - thanks. - Original Message - From: Michael Wechner [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, March 24, 2003 5:03 PM Subject: Re: org.apache.lucene.demo.IndexHTML - parse JSP files? John Bresnik wrote: anyone know of a quick and easy way to get this demo [org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to a crawler to create a local [static] version of the site [i.e. they are not longer JSP files just the html output from the original JSP file - but in the interest of keeping the URL intact, I need to parse the JSP extentions - the short question is, does anyone know of a way to *not* ignore the *.jsp files? just modify IndexHTML: there is one line in there which decides what extension it will index. HTH Michael thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]