Thanks for your quick reply Andy.  I was half expecting it to be this sort
of problem, but I was then puzzled that you can track the line and column
number.

My source is always going to be unicode as I will be parsing a String.  The
reason I want to find out the character positions is because I want to do
the following things:

1.  Insert some code at particular points, but it is imperative that the
rest of the html remains EXACTLY the same.  This is not possible using the
Writer filter, as some properties, such as
http://cyberneko.org/html/properties/names/elems, do not have a "no-change"
option.  Even if this option were present, there are still some changes made
to the output.  One example is that it inserts a <COLGROUP> element, and
</col> tags in the wrong places. (sorry I haven't had time to report these
bugs).  The point is that I don't want to have to worry about Neko being
able to regenerate the original source verbatim, all I want is the character
positions so I can insert the code myself.

2.  I only want to parse the file once, but make some modifications at a
later stage.  I would prefer to persist only a few integer character
positions rather than the whole document node tree.

Is there any way of doing this, assuming a unicode character set?

By the way, congratulations and thanks for your efforts so far.  I can see
Neko being useful to me in future projects, but for my current problem I may
have to use JTidy.

Also, I very nearly didn't find Neko during my search for HTML parsers.  Two
reasons for this:
- You have to dig pretty deep into the Xerces documentation to find out that
it is capable of parsing HTML.  Even the FAQ says it is not possible!  I
think you should advertise the fact on the front page.
- The Neko home page (http://www.apache.org/~andyc/neko/doc/html/index.html)
does not contain any meta keywords or anything to make it easy for search
engines to find it. Have you registered it with any search engines?
Something else I found to be quite ironic, considering one of the prime uses
of Neko, is that it's homepage isn't even valid HTML!

I found it when google came up with one of your posts on a mailing list
archive.  It would be a pity if people start using inferior products simply
because they don't know Neko exists.


----- Original Message -----
From: "Andy Clark" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, August 21, 2002 4:20 AM
Subject: Re: Getting the position of a node in the input stream (using Neko)


> Martin Jericho wrote:
> > I want to parse an HTML document using Neko, and all I want to find out
> > is the character position of particular nodes in the input stream.  I
>
> You have to make a distinction between "character position"
> and "byte offset" into the source file. They are not equivalent
> and can vary greatly depending on the character encoding of
> the file.
>
> > saw the XMLLocator interface, which I presume allows me to find out the
> > line number and the column within the line, but it doesn't include the
> > position from the start of the stream.
>
> This is because it is very difficult to map back to the
> original byte offset of the source file. Unless I wanted to
> re-implement all of the character decoders, that is...
>
> > Is there an example somewhere of doing this with Neko?  I would really
> > appreciate it if someone could help me with this, as I have nearly spent
> > the whole day trying to figure it out from the source code.
>
> Because I use the standard Java character decoders, I have
> no way of knowing the original byte offsets that correspond
> to the resulting Unicode characters.
>
> Could you explain in more detail exactly what information
> you are trying to retrieve?
>
>
> --
> Andy Clark * [EMAIL PROTECTED]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

http://digital.yahoo.com.au - Yahoo! Digital How To
- Get the best out of your PC!

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to