Re: Ignoring Specific Tags with Digester

Simon Kitching Thu, 27 Jul 2006 14:53:26 -0700

On Thu, 2006-07-27 at 16:30 -0500, Paul J DeCoursey wrote:
> Simon Kitching wrote:
> > On Thu, 2006-07-27 at 09:59 -0400, rjn wrote:
> >   
> >> Hi Everyone,
> >>
> >> I'm trying to write a Syndication Feed parser using Digester, however
> >> I'm running into a stumbling block.  Many feeds have HTML in the
> >> entries such as <a>, <br>, etc.   Digester tries to parse these as XML
> >> tags, thus leading to blanks in the data I pull out.  I was wondering
> >> if there was way to set Digester to ignore specific tags (in this
> >> case, the HTML tags)?
> >>     
> >
> > No. Digester uses a standard xml parser to parse its input. That means
> > the input *must* be valid xml. If the input you have to handle isn't
> > valid xml, then you can't use an xml parser to parse it.
> >
> > Perhaps you can use the NekoHTML parser to convert the input to valid
> > XML??
> >   http://java-source.net/open-source/html-parsers/nekohtml
> >
> >
> >   
> I don't think that was the question. I'm guessing the xml is valid, it's 
> just not dealing with the xhtml part of it correctly. I'm not too 
> familiar with Digester to know the solution however.


Ah. You might be right; the original poster did say HTML, but of course
the result would be a failure to parse, not "blanks in the data".

In that case, the NodeCreateRule may be what you are looking for. Just
reserialise the nested node back into text. Note that because the input
is being processed by a standard xml parser, and xml parsers have *no*
option for "don't parse the children of this node", neither does
Digester.

Regards,

Simon


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Ignoring Specific Tags with Digester

Reply via email to