Re: org.apache.lucene.demo.IndexHTML - parse JSP files?

2003-03-25 Thread Tatu Saloranta
On Monday 24 March 2003 18:03, Michael Wechner wrote:
 John Bresnik wrote:
 anyone know of a quick and easy way to get this demo
 [org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to a
 crawler to create a local [static] version of the site [i.e. they are not
 longer JSP files just the html output from the original JSP file  - but
  in the interest of keeping the URL intact, I need to parse the JSP
  extentions - the short question is, does anyone know of a way to *not*
  ignore the *.jsp files?

 just modify IndexHTML: there is one line in there which decides what
 extension it will index.

There is another question I was wondering; since JSP is not XML (ie. can not 
be reliably parse using an XML or even HTML parser [or for that matter, even 
with simplest XML markup tokenizer that ignores nesting], needs a lower level 
scanner), has anyone tried connecting an actual JSP processor to Lucene? Or 
writing a simple one just meant for indexing, without having to execute code 
embedded?
[the problem with JSP compared to XML is that it need not nest properly with 
HTML content around; one can use JSP inside attribute values, for example; 
thus, first JSP has to be processed to HTML, and then HTML needs to be 
further tokenized]

Jakarta has to have at least one such processor (haven't looked at whether 
there's a separate component or if Tomcat just has one embedded?). Of course 
parsing JSP is problematic in many ways, not just getting jsp tagging out; 
dynamic portions probably just have to be ignored, and all text inside 
included (except for things inside comments).

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: org.apache.lucene.demo.IndexHTML - parse JSP files?

2003-03-25 Thread MMachado
Maybe you need to write for example a jsp for the search interface, another
jsp that take the word that you search and this second jsp page goes
directly to a bean with lucene that will do the job. 
Michel 

-Original Message-
From: Tatu Saloranta [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 25, 2003 3:46 PM
To: Lucene Users List
Subject: Re: org.apache.lucene.demo.IndexHTML - parse JSP files?

On Monday 24 March 2003 18:03, Michael Wechner wrote:
 John Bresnik wrote:
 anyone know of a quick and easy way to get this demo
 [org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to
a
 crawler to create a local [static] version of the site [i.e. they are not
 longer JSP files just the html output from the original JSP file  - but
  in the interest of keeping the URL intact, I need to parse the JSP
  extentions - the short question is, does anyone know of a way to *not*
  ignore the *.jsp files?

 just modify IndexHTML: there is one line in there which decides what
 extension it will index.

There is another question I was wondering; since JSP is not XML (ie. can not

be reliably parse using an XML or even HTML parser [or for that matter, even

with simplest XML markup tokenizer that ignores nesting], needs a lower
level 
scanner), has anyone tried connecting an actual JSP processor to Lucene? Or 
writing a simple one just meant for indexing, without having to execute code

embedded?
[the problem with JSP compared to XML is that it need not nest properly with

HTML content around; one can use JSP inside attribute values, for example; 
thus, first JSP has to be processed to HTML, and then HTML needs to be 
further tokenized]

Jakarta has to have at least one such processor (haven't looked at whether 
there's a separate component or if Tomcat just has one embedded?). Of course

parsing JSP is problematic in many ways, not just getting jsp tagging out; 
dynamic portions probably just have to be ignored, and all text inside 
included (except for things inside comments).

-+ Tatu +-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



org.apache.lucene.demo.IndexHTML - parse JSP files?

2003-03-24 Thread John Bresnik
anyone know of a quick and easy way to get this demo
[org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to a
crawler to create a local [static] version of the site [i.e. they are not
longer JSP files just the html output from the original JSP file  - but in
the interest of keeping the URL intact, I need to parse the JSP extentions -
the short question is, does anyone know of a way to *not* ignore the *.jsp
files?

thanks.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: org.apache.lucene.demo.IndexHTML - parse JSP files?

2003-03-24 Thread Michael Wechner
John Bresnik wrote:

anyone know of a quick and easy way to get this demo
[org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to a
crawler to create a local [static] version of the site [i.e. they are not
longer JSP files just the html output from the original JSP file  - but in
the interest of keeping the URL intact, I need to parse the JSP extentions -
the short question is, does anyone know of a way to *not* ignore the *.jsp
files?
just modify IndexHTML: there is one line in there which decides what 
extension it will index.

HTH

Michael

thanks.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: org.apache.lucene.demo.IndexHTML - parse JSP files?

2003-03-24 Thread John Bresnik
ah thanks.. i couldnt find the demo classes [turns out they were in a
different dir] - thanks.

- Original Message -
From: Michael Wechner [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, March 24, 2003 5:03 PM
Subject: Re: org.apache.lucene.demo.IndexHTML - parse JSP files?


 John Bresnik wrote:

 anyone know of a quick and easy way to get this demo
 [org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to
a
 crawler to create a local [static] version of the site [i.e. they are not
 longer JSP files just the html output from the original JSP file  - but
in
 the interest of keeping the URL intact, I need to parse the JSP
extentions -
 the short question is, does anyone know of a way to *not* ignore the
*.jsp
 files?
 

 just modify IndexHTML: there is one line in there which decides what
 extension it will index.

 HTH

 Michael

 
 thanks.
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]