Thanks for your quick reply Andy. I was half expecting it to be this sort of problem, but I was then puzzled that you can track the line and column number.
My source is always going to be unicode as I will be parsing a String. The reason I want to find out the character positions is because I want to do the following things: 1. Insert some code at particular points, but it is imperative that the rest of the html remains EXACTLY the same. This is not possible using the Writer filter, as some properties, such as http://cyberneko.org/html/properties/names/elems, do not have a "no-change" option. Even if this option were present, there are still some changes made to the output. One example is that it inserts a <COLGROUP> element, and </col> tags in the wrong places. (sorry I haven't had time to report these bugs). The point is that I don't want to have to worry about Neko being able to regenerate the original source verbatim, all I want is the character positions so I can insert the code myself. 2. I only want to parse the file once, but make some modifications at a later stage. I would prefer to persist only a few integer character positions rather than the whole document node tree. Is there any way of doing this, assuming a unicode character set? By the way, congratulations and thanks for your efforts so far. I can see Neko being useful to me in future projects, but for my current problem I may have to use JTidy. Also, I very nearly didn't find Neko during my search for HTML parsers. Two reasons for this: - You have to dig pretty deep into the Xerces documentation to find out that it is capable of parsing HTML. Even the FAQ says it is not possible! I think you should advertise the fact on the front page. - The Neko home page (http://www.apache.org/~andyc/neko/doc/html/index.html) does not contain any meta keywords or anything to make it easy for search engines to find it. Have you registered it with any search engines? Something else I found to be quite ironic, considering one of the prime uses of Neko, is that it's homepage isn't even valid HTML! I found it when google came up with one of your posts on a mailing list archive. It would be a pity if people start using inferior products simply because they don't know Neko exists. ----- Original Message ----- From: "Andy Clark" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, August 21, 2002 4:20 AM Subject: Re: Getting the position of a node in the input stream (using Neko) > Martin Jericho wrote: > > I want to parse an HTML document using Neko, and all I want to find out > > is the character position of particular nodes in the input stream. I > > You have to make a distinction between "character position" > and "byte offset" into the source file. They are not equivalent > and can vary greatly depending on the character encoding of > the file. > > > saw the XMLLocator interface, which I presume allows me to find out the > > line number and the column within the line, but it doesn't include the > > position from the start of the stream. > > This is because it is very difficult to map back to the > original byte offset of the source file. Unless I wanted to > re-implement all of the character decoders, that is... > > > Is there an example somewhere of doing this with Neko? I would really > > appreciate it if someone could help me with this, as I have nearly spent > > the whole day trying to figure it out from the source code. > > Because I use the standard Java character decoders, I have > no way of knowing the original byte offsets that correspond > to the resulting Unicode characters. > > Could you explain in more detail exactly what information > you are trying to retrieve? > > > -- > Andy Clark * [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] http://digital.yahoo.com.au - Yahoo! Digital How To - Get the best out of your PC! --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
