Re: Getting the position of a node in the input stream (using Neko)

Andy Clark 21 Aug 2002 16:40:52 -0000

Martin Jericho wrote:

Thanks for your quick reply Andy.  I was half expecting it to be this sort
of problem, but I was then puzzled that you can track the line and column
number.


That's standard locator information provided by the SAX
interfaces. So we implement that in Xerces and I decided
to implement the same thing in NekoHTML. But in neither
case do we track "character offsets", which I think has
limited usefulness but others disagree.

1.  Insert some code at particular points, but it is imperative that the
rest of the html remains EXACTLY the same.  This is not possible using the
Writer filter, as some properties, such as
http://cyberneko.org/html/properties/names/elems, do not have a "no-change"
option.  Even if this option were present, there are still some changes made


Because "no-change" has the potential of producing XML
that is not well-formed. And the whole purpose of Neko-
HTML is to parse HTML and make it appear as XML. Take
the following instance:

  <tAbLe> ... </TaBlE>

How do you handle this and still make it well formed
in an XML sense? NekoHTML lets you transform these to
uppercase, lowercase, or just to match the end tag w/
whatever the start tag is. The latter option will
produce the following:

  <tAbLe> ... </tAbLe>

to the output.  One example is that it inserts a <COLGROUP> element, and
</col> tags in the wrong places. (sorry I haven't had time to report these


Please let me know more detail about these bugs so
that I can fix them. Minimal sample files would be
preferable.

bugs).  The point is that I don't want to have to worry about Neko being
able to regenerate the original source verbatim, all I want is the character
positions so I can insert the code myself.


Is your HTML string generated? Or serialized into a
String object?

Is there any way of doing this, assuming a unicode character set?


Not unless you've stored the offsets for each separate
line within the document String. Then you could use the
line/column information.

By the way, congratulations and thanks for your efforts so far.  I can see
Neko being useful to me in future projects, but for my current problem I may
have to use JTidy.


By all means. JTidy is a really nice tool.

Also, I very nearly didn't find Neko during my search for HTML parsers.  Two
reasons for this:
- You have to dig pretty deep into the Xerces documentation to find out that
it is capable of parsing HTML.  Even the FAQ says it is not possible!  I


Xerces is *not* capable of parsing HTML because HTML is
not XML. However, we do mention that an HTML parser
configuration is possible with the XNI framework. And
perhaps NekoHTML will make it into the Xerces download
or at least as a kind of sub-project that's explicitly
mentioned in the Xerces pages.

- The Neko home page (http://www.apache.org/~andyc/neko/doc/html/index.html)
does not contain any meta keywords or anything to make it easy for search
engines to find it. Have you registered it with any search engines?


Not really. It's rather quiet 'cause I concentrate on
Xerces developers. However, I do announce on Freshmeat
so that make sa lot of people aware of its existence.

Something else I found to be quite ironic, considering one of the prime uses
of Neko, is that it's homepage isn't even valid HTML!


Ummm... that's sort of the point. To show that NekoHTML
can even parse its own sloppy documentation. :)

I found it when google came up with one of your posts on a mailing list
archive.  It would be a pity if people start using inferior products simply
because they don't know Neko exists.


True. I should work harder to let people know that it
is available.

Thanks for the insight and I'd really like to hear more
about the bugs you're experiencing. Type at you later...

--
Andy Clark * [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Getting the position of a node in the input stream (using Neko)

Reply via email to