Thank you very much for the helpful reply, I'm back on track.
On Tue, Oct 20, 2009 at 2:01 AM, Andrzej Bialecki wrote:
> malcolm smith wrote:
>
>> I am looking to create a parser for a groupware product that would read
>> pages message board type web site. (Think phpBB). But rather than
>> cr
malcolm smith wrote:
I am looking to create a parser for a groupware product that would read
pages message board type web site. (Think phpBB). But rather than creating
a single Content item which is parsed and indexed to a single lucene
document, I am planning to have the parser create a master
I am looking to create a parser for a groupware product that would read
pages message board type web site. (Think phpBB). But rather than creating
a single Content item which is parsed and indexed to a single lucene
document, I am planning to have the parser create a master document (for the
orig
Ooops...actually I meant to ask XHTML parser. Is it safe to use HTML parser
to parse XHTML?
On 3/30/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>
> Rajesh Munavalli wrote:
> > Does anyone know where I can get the source code for html parser which
> is in
>
Rajesh Munavalli wrote:
Does anyone know where I can get the source code for html parser which is in
the plugins directory?
Which one? parse-html uses two parsers: one is called CyberNeko, the
other is called TagSoup. You can find their home pages and their sources
easily through Google
Does anyone know where I can get the source code for html parser which is in
the plugins directory?
t; Has any one experience a problem with the way the
> > standard html parser plugin handles relative urls?
> >
> > There is a site where the home page is something
> like
> >
> > http://www.x.com/x.cgi
> >
> > and when browsing a link wi
I think Nutch is behaving correctly.
Maybe that page has a BASE URL (view source, look at the HEAD elements)
that throws off one or the other.
Otis
--- Raymond Creel <[EMAIL PROTECTED]> wrote:
> Has any one experience a problem with the way the
> standard html parser plugin hand
Has any one experience a problem with the way the
standard html parser plugin handles relative urls?
There is a site where the home page is something like
http://www.x.com/x.cgi
and when browsing a link with its href set to
'?paramname=paramvalue'
a browser will naturally t